tidymodels - Why does tune_grid find character variables instead of factors?

Below is a self-contained code example.

I have test_data with character columns (name, id, gender)
I convert them all to factors
I mark name and id as "informational" (i.e. not to be used model building)
When I run tune_grid, it complains about name, id, and gender being character, even though 2 of them should be ignored, and all 3 are factors.

I want to keep all the columns around so I can debug issues, so I don't just want to drop them.

I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.

Why is this happening?

Error:

→ A | error:   ✖ The following variables have the wrong class:
               • `name` must have class <factor>, not <character>.
               • `id` must have class <factor>, not <character>.
               • `gender` must have class <factor>, not <character>.

Debug info during the run:

> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
 $ splits:List of 2
  ..$ :List of 4
  .. ..$ data  : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ name          : Factor w/ 2 levels "sam","unknown": 1 1 1
  .. .. ..$ id            : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
  .. .. ..$ gender        : Factor w/ 3 levels "female","male",..: 2 1 1
  .. .. ..$ target        : num [1:3] 4 5 6
  .. .. ..$ gender_male   : num [1:3] 1 0 0
  .. .. ..$ gender_unknown: num [1:3] 0 0 0
  .. ..$ in_id : int 3
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Fold1"
  .. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
  ..$ :List of 4
  .. ..$ data  : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ name          : Factor w/ 2 levels "sam","unknown": 1 1 1
  .. .. ..$ id            : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
  .. .. ..$ gender        : Factor w/ 3 levels "female","male",..: 2 1 1
  .. .. ..$ target        : num [1:3] 4 5 6
  .. .. ..$ gender_male   : num [1:3] 1 0 0
  .. .. ..$ gender_unknown: num [1:3] 0 0 0
  .. ..$ in_id : int [1:2] 1 2
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Fold2"
  .. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
 $ id    : chr [1:2] "Fold1" "Fold2"
 - attr(*, "v")= num 2
 - attr(*, "repeats")= num 1
 - attr(*, "breaks")= num 4
 - attr(*, "pool")= num 0.1
 - attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error:   ✖ The following variables have the wrong class:
               • `name` must have class <factor>, not <character>.
               • `id` must have class <factor>, not <character>.
               • `gender` must have class <factor>, not <character>.
There were issues with some computations   A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information. 
> rlang::last_trace()
<error/rlang_error>
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
    ▆
 1. ├─global train_lasso_model(recipe_obj, processed_data)
 2. │ └─tune_results %>% select_best(metric = "roc_auc")
 3. ├─tune::select_best(., metric = "roc_auc")
 4. └─tune:::select_best.tune_results(., metric = "roc_auc")
 5.   ├─tune::show_best(...)
 6.   └─tune:::show_best.tune_results(...)
 7.     └─tune::.filter_perf_metrics(x, metric, eval_time)
 8.       └─tune::estimate_tune_results(x)

> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
    name       id   gender 
"factor" "factor" "factor" 
> sapply(processed_data[, info_vars], class)
    name       id   gender 
"factor" "factor" "factor" 
> class(processed_data)
[1] "tbl_df"     "tbl"        "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’

Code:

library(recipes)
library(workflows)

train_lasso_model <- function(recipe_obj, processed_data,
                              grid_size = 10, folds=2) {
    # Create a logistic regression model specification with Lasso regularization
  log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
    set_engine("glmnet")

  # Create a workflow
  workflow_obj <- workflow() %>%
    add_recipe(recipe_obj) %>%
    add_model(log_reg_spec)

  # Set up cross-validation
  cv_folds <- vfold_cv(processed_data, v = folds)
  str(cv_folds)

  # Tune the model to find the best regularization strength (penalty)
  message("before tune_grid")
  tune_results <- workflow_obj %>%
    tune_grid(resamples = cv_folds, grid = grid_size)
  message("after tune_grid")

  # Check the best tuning parameters (lambda)
  best_lambda <- tune_results %>%
    select_best(metric = "roc_auc")

  # Finalize the workflow with the best penalty
  message("before finalize_workflow")
  final_workflow <- workflow_obj %>%
    finalize_workflow(best_lambda)
  message("after finalize_workflow")

  # Fit the final model
  final_model <- fit(final_workflow, data = processed_data)

  # Return the trained model
  return(final_model)
}

test_data <- data.frame(
  name = c("sam", "sam", "sam"),
  id = c("1", "2", "3"),
  gender = c("male", "female", "female"),
  target = c(4, 5, 6)
)

info_vars <- c("name", "id",
        # mark gender as informational, but still make it a dummy var
        "gender")

recipe_obj <- recipe(target ~ ., data = test_data) %>%
  # mark vars as not used in the model
  update_role(
    all_of(info_vars),
    new_role = "informational") %>%
  # Create an "unknown" category for all unknown factor levels
  step_unknown(all_nominal(), skip = TRUE) %>%
  # Convert factors/character columns to dummies
  step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
             gender,
             keep_original_cols = TRUE)

prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)

final_model <- train_lasso_model(recipe_obj, processed_data)

train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")

NOTE: I get the same error if I comment out step_unknown and step_dummy.

Below is a self-contained code example.

I have test_data with character columns (name, id, gender)
I convert them all to factors
I mark name and id as "informational" (i.e. not to be used model building)
When I run tune_grid, it complains about name, id, and gender being character, even though 2 of them should be ignored, and all 3 are factors.

I want to keep all the columns around so I can debug issues, so I don't just want to drop them.

I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.

Why is this happening?

Error:

→ A | error:   ✖ The following variables have the wrong class:
               • `name` must have class <factor>, not <character>.
               • `id` must have class <factor>, not <character>.
               • `gender` must have class <factor>, not <character>.

Debug info during the run:

> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
 $ splits:List of 2
  ..$ :List of 4
  .. ..$ data  : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ name          : Factor w/ 2 levels "sam","unknown": 1 1 1
  .. .. ..$ id            : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
  .. .. ..$ gender        : Factor w/ 3 levels "female","male",..: 2 1 1
  .. .. ..$ target        : num [1:3] 4 5 6
  .. .. ..$ gender_male   : num [1:3] 1 0 0
  .. .. ..$ gender_unknown: num [1:3] 0 0 0
  .. ..$ in_id : int 3
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Fold1"
  .. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
  ..$ :List of 4
  .. ..$ data  : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ name          : Factor w/ 2 levels "sam","unknown": 1 1 1
  .. .. ..$ id            : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
  .. .. ..$ gender        : Factor w/ 3 levels "female","male",..: 2 1 1
  .. .. ..$ target        : num [1:3] 4 5 6
  .. .. ..$ gender_male   : num [1:3] 1 0 0
  .. .. ..$ gender_unknown: num [1:3] 0 0 0
  .. ..$ in_id : int [1:2] 1 2
  .. ..$ out_id: logi NA
  .. ..$ id    : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
  .. .. ..$ id: chr "Fold2"
  .. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
 $ id    : chr [1:2] "Fold1" "Fold2"
 - attr(*, "v")= num 2
 - attr(*, "repeats")= num 1
 - attr(*, "breaks")= num 4
 - attr(*, "pool")= num 0.1
 - attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error:   ✖ The following variables have the wrong class:
               • `name` must have class <factor>, not <character>.
               • `id` must have class <factor>, not <character>.
               • `gender` must have class <factor>, not <character>.
There were issues with some computations   A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information. 
> rlang::last_trace()
<error/rlang_error>
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
    ▆
 1. ├─global train_lasso_model(recipe_obj, processed_data)
 2. │ └─tune_results %>% select_best(metric = "roc_auc")
 3. ├─tune::select_best(., metric = "roc_auc")
 4. └─tune:::select_best.tune_results(., metric = "roc_auc")
 5.   ├─tune::show_best(...)
 6.   └─tune:::show_best.tune_results(...)
 7.     └─tune::.filter_perf_metrics(x, metric, eval_time)
 8.       └─tune::estimate_tune_results(x)

> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
    name       id   gender 
"factor" "factor" "factor" 
> sapply(processed_data[, info_vars], class)
    name       id   gender 
"factor" "factor" "factor" 
> class(processed_data)
[1] "tbl_df"     "tbl"        "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’

Code:

library(recipes)
library(workflows)

train_lasso_model <- function(recipe_obj, processed_data,
                              grid_size = 10, folds=2) {
    # Create a logistic regression model specification with Lasso regularization
  log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
    set_engine("glmnet")

  # Create a workflow
  workflow_obj <- workflow() %>%
    add_recipe(recipe_obj) %>%
    add_model(log_reg_spec)

  # Set up cross-validation
  cv_folds <- vfold_cv(processed_data, v = folds)
  str(cv_folds)

  # Tune the model to find the best regularization strength (penalty)
  message("before tune_grid")
  tune_results <- workflow_obj %>%
    tune_grid(resamples = cv_folds, grid = grid_size)
  message("after tune_grid")

  # Check the best tuning parameters (lambda)
  best_lambda <- tune_results %>%
    select_best(metric = "roc_auc")

  # Finalize the workflow with the best penalty
  message("before finalize_workflow")
  final_workflow <- workflow_obj %>%
    finalize_workflow(best_lambda)
  message("after finalize_workflow")

  # Fit the final model
  final_model <- fit(final_workflow, data = processed_data)

  # Return the trained model
  return(final_model)
}

test_data <- data.frame(
  name = c("sam", "sam", "sam"),
  id = c("1", "2", "3"),
  gender = c("male", "female", "female"),
  target = c(4, 5, 6)
)

info_vars <- c("name", "id",
        # mark gender as informational, but still make it a dummy var
        "gender")

recipe_obj <- recipe(target ~ ., data = test_data) %>%
  # mark vars as not used in the model
  update_role(
    all_of(info_vars),
    new_role = "informational") %>%
  # Create an "unknown" category for all unknown factor levels
  step_unknown(all_nominal(), skip = TRUE) %>%
  # Convert factors/character columns to dummies
  step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
             gender,
             keep_original_cols = TRUE)

prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)

final_model <- train_lasso_model(recipe_obj, processed_data)

train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")

NOTE: I get the same error if I comment out step_unknown and step_dummy.

Share Improve this question edited Feb 4 at 23:41 asked Feb 4 at 16:49 dfrankow 21.5k44 gold badges161 silver badges240 bronze badges

TBH I think there are a few issues here, at least one of which is causing the issue (not meaning to be negative). First, you are asking for gender to not be treated as a predictor, then treat it as one. The roles are discordant and that is probably what the issue is. If you want to retain gender, there are other ways of doing that. Can you give us a sense of what you want to do? Your selectors could also help. Using all_nominal_predictors() might be a good idea but I would need more context to say for sure. – topepo Commented Feb 4 at 18:02
Finally, it’s not a good idea to preprocess the data then resample. I strongly suggest putting the recipe in the workflow and then resample the original data set. – topepo Commented Feb 4 at 18:03
Thanks for your reply. I want to make all predictors (like gender) into 0/1 variables (e.g., gender_male, gender_female, gender_unknown), but then also retain the original variable for analysis. The only way I could see to do that was to mark gender as informational, but then pour it into dummy vars that are not informational. – dfrankow Commented Feb 4 at 18:21
Also: I don't understand "preprocess the data then resample". Do you mean the cross-fold validation? I am new to recipes, and have no idea how to do this traditional thing: pre-process the data, then use cross-fold validation on a training set. – dfrankow Commented Feb 4 at 18:22
In my own code, I have the recipe in a workflow, but I trimmed it out to get a reproducible test case. On the face of it, I still don't understand where the "character" vars come from, or how to do this correctly. I have read tidymodels./start/recipes, but maybe some of it has not stuck yet. – dfrankow Commented Feb 4 at 18:23

| Show 2 more comments

1 Answer 1

Sorted by: Reset to default 0

Thanks for the additional details and the reproducible example. There are two ways to go about this. First, you could make a copy of gender and give it a non-predictor role. Alternatively, you can keep the original column (as you do) and remove it from the model when the workflow is shown. I took the latter approach.

Some small comments:

The outcome needs to be a factor for logistic regression, not multiple integers.
We want resampling to account for everything that happens to the data. It is much better to put the recipe in a workflow and have it re-run within each resample, See section 7.1 of the tidymodels book for an explanation.

Here’s the code that I would use:

library(tidymodels)
tidymodels_prefer()


test_data <- expand.grid(
  name = c("sam", "sam", "sam"),
  id = c("1", "2", "3"),
  gender = c("male", "female", "female"),
  # NOTE: Don't use `c(4, 5, 6)` as the outcome for logistic reg
  target = c("yes", "no")
) %>% 
  # NOTE:Make the categorical data into factors
  mutate(across(where(is.character), ~ as.factor(.x)))


recipe_obj <- 
  recipe(target ~ ., data = test_data) %>%
  # mark vars as not used in the model
  update_role(name, id, new_role = "informational") %>%
  # Create an "unknown" category for all unknown factor levels
  # NOTE: capture all character/factor _predictors_ for transformations
  step_unknown(all_nominal_predictors()) %>%
  # Convert factors/character columns to dummies
  step_dummy(all_nominal_predictors(), keep_original_cols = TRUE)

log_reg_spec <- 
  logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

# Create a workflow
workflow_obj <- workflow() %>%
  add_recipe(recipe_obj) %>%
  # NOTE: specify what the model gets post-recipe. 
  # See https://www.tmwr./workflows#special-model-formulas
  # This removes gender from being in the model (but uses its indicators) 
  # but keeps the gender column in the data. 
  add_model(log_reg_spec, formula = target ~ -gender + .)


grid_size <- 10
num_folds <- 2

cv_folds <- vfold_cv(test_data, v = num_folds)

tune_results <- workflow_obj %>%
  tune_grid(resamples = cv_folds, grid = grid_size)
tune_results
#> # Tuning results
#> # 2-fold cross-validation 
#> # A tibble: 2 × 4
#>   splits          id    .metrics          .notes          
#>   <list>          <chr> <list>            <list>          
#> 1 <split [27/27]> Fold1 <tibble [30 × 5]> <tibble [0 × 3]>
#> 2 <split [27/27]> Fold2 <tibble [30 × 5]> <tibble [0 × 3]>

^{Created on 2025-02-04 with reprex v2.1.0}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

tidymodels - Why does tune_grid find character variables instead of factors? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)