Below is a self-contained code example.
- I have test_data with character columns (name, id, gender)
- I convert them all to factors
- I mark name and id as "informational" (i.e. not to be used model building)
- When I run tune_grid, it complains about name, id, and gender being character, even though 2 of them should be ignored, and all 3 are factors.
I want to keep all the columns around so I can debug issues, so I don't just want to drop them.
I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.
Why is this happening?
Error:
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
Debug info during the run:
> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
$ splits:List of 2
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int 3
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold1"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int [1:2] 1 2
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold2"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
$ id : chr [1:2] "Fold1" "Fold2"
- attr(*, "v")= num 2
- attr(*, "repeats")= num 1
- attr(*, "breaks")= num 4
- attr(*, "pool")= num 0.1
- attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
There were issues with some computations A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information.
> rlang::last_trace()
<error/rlang_error>
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
▆
1. ├─global train_lasso_model(recipe_obj, processed_data)
2. │ └─tune_results %>% select_best(metric = "roc_auc")
3. ├─tune::select_best(., metric = "roc_auc")
4. └─tune:::select_best.tune_results(., metric = "roc_auc")
5. ├─tune::show_best(...)
6. └─tune:::show_best.tune_results(...)
7. └─tune::.filter_perf_metrics(x, metric, eval_time)
8. └─tune::estimate_tune_results(x)
> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
name id gender
"factor" "factor" "factor"
> sapply(processed_data[, info_vars], class)
name id gender
"factor" "factor" "factor"
> class(processed_data)
[1] "tbl_df" "tbl" "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’
Code:
library(recipes)
library(workflows)
train_lasso_model <- function(recipe_obj, processed_data,
grid_size = 10, folds=2) {
# Create a logistic regression model specification with Lasso regularization
log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# Create a workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
add_model(log_reg_spec)
# Set up cross-validation
cv_folds <- vfold_cv(processed_data, v = folds)
str(cv_folds)
# Tune the model to find the best regularization strength (penalty)
message("before tune_grid")
tune_results <- workflow_obj %>%
tune_grid(resamples = cv_folds, grid = grid_size)
message("after tune_grid")
# Check the best tuning parameters (lambda)
best_lambda <- tune_results %>%
select_best(metric = "roc_auc")
# Finalize the workflow with the best penalty
message("before finalize_workflow")
final_workflow <- workflow_obj %>%
finalize_workflow(best_lambda)
message("after finalize_workflow")
# Fit the final model
final_model <- fit(final_workflow, data = processed_data)
# Return the trained model
return(final_model)
}
test_data <- data.frame(
name = c("sam", "sam", "sam"),
id = c("1", "2", "3"),
gender = c("male", "female", "female"),
target = c(4, 5, 6)
)
info_vars <- c("name", "id",
# mark gender as informational, but still make it a dummy var
"gender")
recipe_obj <- recipe(target ~ ., data = test_data) %>%
# mark vars as not used in the model
update_role(
all_of(info_vars),
new_role = "informational") %>%
# Create an "unknown" category for all unknown factor levels
step_unknown(all_nominal(), skip = TRUE) %>%
# Convert factors/character columns to dummies
step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
gender,
keep_original_cols = TRUE)
prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)
final_model <- train_lasso_model(recipe_obj, processed_data)
train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")
NOTE: I get the same error if I comment out step_unknown and step_dummy.
Below is a self-contained code example.
- I have test_data with character columns (name, id, gender)
- I convert them all to factors
- I mark name and id as "informational" (i.e. not to be used model building)
- When I run tune_grid, it complains about name, id, and gender being character, even though 2 of them should be ignored, and all 3 are factors.
I want to keep all the columns around so I can debug issues, so I don't just want to drop them.
I also want to separate the data processing (with recipes) and the model training. I'll have lots of different preprocessing recipes, but I'll be building the same sort of model over and over.
Why is this happening?
Error:
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
Debug info during the run:
> final_model <- train_lasso_model(recipe_obj, processed_data)
vfold [2 × 2] (S3: vfold_cv/rset/tbl_df/tbl/data.frame)
$ splits:List of 2
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int 3
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold1"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
..$ :List of 4
.. ..$ data : tibble [3 × 6] (S3: tbl_df/tbl/data.frame)
.. .. ..$ name : Factor w/ 2 levels "sam","unknown": 1 1 1
.. .. ..$ id : Factor w/ 4 levels "1","2","3","unknown": 1 2 3
.. .. ..$ gender : Factor w/ 3 levels "female","male",..: 2 1 1
.. .. ..$ target : num [1:3] 4 5 6
.. .. ..$ gender_male : num [1:3] 1 0 0
.. .. ..$ gender_unknown: num [1:3] 0 0 0
.. ..$ in_id : int [1:2] 1 2
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Fold2"
.. ..- attr(*, "class")= chr [1:2] "vfold_split" "rsplit"
$ id : chr [1:2] "Fold1" "Fold2"
- attr(*, "v")= num 2
- attr(*, "repeats")= num 1
- attr(*, "breaks")= num 4
- attr(*, "pool")= num 0.1
- attr(*, "fingerprint")= chr "2c80c86a0361fcf4a6d480eb1b0b8d79"
before tune_grid
→ A | error: ✖ The following variables have the wrong class:
• `name` must have class <factor>, not <character>.
• `id` must have class <factor>, not <character>.
• `gender` must have class <factor>, not <character>.
There were issues with some computations A: x2
after tune_grid
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
All models failed. Run `show_notes(.Last.tune.result)` for more information.
> rlang::last_trace()
<error/rlang_error>
Error in `estimate_tune_results()`:
! All models failed. Run `show_notes(.Last.tune.result)` for more information.
---
Backtrace:
▆
1. ├─global train_lasso_model(recipe_obj, processed_data)
2. │ └─tune_results %>% select_best(metric = "roc_auc")
3. ├─tune::select_best(., metric = "roc_auc")
4. └─tune:::select_best.tune_results(., metric = "roc_auc")
5. ├─tune::show_best(...)
6. └─tune:::show_best.tune_results(...)
7. └─tune::.filter_perf_metrics(x, metric, eval_time)
8. └─tune::estimate_tune_results(x)
> train <- prepped_recipe %>% juice
> sapply(train[, info_vars], class)
name id gender
"factor" "factor" "factor"
> sapply(processed_data[, info_vars], class)
name id gender
"factor" "factor" "factor"
> class(processed_data)
[1] "tbl_df" "tbl" "data.frame"
> packageVersion("tune")
[1] ‘1.2.1’
Code:
library(recipes)
library(workflows)
train_lasso_model <- function(recipe_obj, processed_data,
grid_size = 10, folds=2) {
# Create a logistic regression model specification with Lasso regularization
log_reg_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# Create a workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
add_model(log_reg_spec)
# Set up cross-validation
cv_folds <- vfold_cv(processed_data, v = folds)
str(cv_folds)
# Tune the model to find the best regularization strength (penalty)
message("before tune_grid")
tune_results <- workflow_obj %>%
tune_grid(resamples = cv_folds, grid = grid_size)
message("after tune_grid")
# Check the best tuning parameters (lambda)
best_lambda <- tune_results %>%
select_best(metric = "roc_auc")
# Finalize the workflow with the best penalty
message("before finalize_workflow")
final_workflow <- workflow_obj %>%
finalize_workflow(best_lambda)
message("after finalize_workflow")
# Fit the final model
final_model <- fit(final_workflow, data = processed_data)
# Return the trained model
return(final_model)
}
test_data <- data.frame(
name = c("sam", "sam", "sam"),
id = c("1", "2", "3"),
gender = c("male", "female", "female"),
target = c(4, 5, 6)
)
info_vars <- c("name", "id",
# mark gender as informational, but still make it a dummy var
"gender")
recipe_obj <- recipe(target ~ ., data = test_data) %>%
# mark vars as not used in the model
update_role(
all_of(info_vars),
new_role = "informational") %>%
# Create an "unknown" category for all unknown factor levels
step_unknown(all_nominal(), skip = TRUE) %>%
# Convert factors/character columns to dummies
step_dummy(all_nominal(), -all_outcomes(), -all_of(info_vars),
gender,
keep_original_cols = TRUE)
prepped_recipe <- recipe_obj %>% prep(training = test_data)
processed_data <- prepped_recipe %>% bake(new_data=NULL)
final_model <- train_lasso_model(recipe_obj, processed_data)
train <- prepped_recipe %>% juice
sapply(train[, info_vars], class)
sapply(processed_data[, info_vars], class)
class(processed_data)
packageVersion("tune")
NOTE: I get the same error if I comment out step_unknown and step_dummy.
Share Improve this question edited Feb 4 at 23:41 dfrankow asked Feb 4 at 16:49 dfrankowdfrankow 21.5k44 gold badges161 silver badges240 bronze badges 7 | Show 2 more comments1 Answer
Reset to default 0Thanks for the additional details and the reproducible example. There are two ways to go about this. First, you could make a copy of gender
and give it a non-predictor role. Alternatively, you can keep the original column (as you do) and remove it from the model when the workflow is shown. I took the latter approach.
Some small comments:
- The outcome needs to be a factor for logistic regression, not multiple integers.
- We want resampling to account for everything that happens to the data. It is much better to put the recipe in a workflow and have it re-run within each resample, See section 7.1 of the tidymodels book for an explanation.
Here’s the code that I would use:
library(tidymodels)
tidymodels_prefer()
test_data <- expand.grid(
name = c("sam", "sam", "sam"),
id = c("1", "2", "3"),
gender = c("male", "female", "female"),
# NOTE: Don't use `c(4, 5, 6)` as the outcome for logistic reg
target = c("yes", "no")
) %>%
# NOTE:Make the categorical data into factors
mutate(across(where(is.character), ~ as.factor(.x)))
recipe_obj <-
recipe(target ~ ., data = test_data) %>%
# mark vars as not used in the model
update_role(name, id, new_role = "informational") %>%
# Create an "unknown" category for all unknown factor levels
# NOTE: capture all character/factor _predictors_ for transformations
step_unknown(all_nominal_predictors()) %>%
# Convert factors/character columns to dummies
step_dummy(all_nominal_predictors(), keep_original_cols = TRUE)
log_reg_spec <-
logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
# Create a workflow
workflow_obj <- workflow() %>%
add_recipe(recipe_obj) %>%
# NOTE: specify what the model gets post-recipe.
# See https://www.tmwr./workflows#special-model-formulas
# This removes gender from being in the model (but uses its indicators)
# but keeps the gender column in the data.
add_model(log_reg_spec, formula = target ~ -gender + .)
grid_size <- 10
num_folds <- 2
cv_folds <- vfold_cv(test_data, v = num_folds)
tune_results <- workflow_obj %>%
tune_grid(resamples = cv_folds, grid = grid_size)
tune_results
#> # Tuning results
#> # 2-fold cross-validation
#> # A tibble: 2 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [27/27]> Fold1 <tibble [30 × 5]> <tibble [0 × 3]>
#> 2 <split [27/27]> Fold2 <tibble [30 × 5]> <tibble [0 × 3]>
Created on 2025-02-04 with reprex v2.1.0
gender
to not be treated as a predictor, then treat it as one. The roles are discordant and that is probably what the issue is. If you want to retaingender
, there are other ways of doing that. Can you give us a sense of what you want to do? Your selectors could also help. Usingall_nominal_predictors()
might be a good idea but I would need more context to say for sure. – topepo Commented Feb 4 at 18:02