python - Different Feature Selection Results Between Local (Ubuntu VM) and Databricks Using sklearn's SequentialFeatureS

I am migrating from running my machine learning pipeline in VS Code with Ubuntu on a VM into Databricks. When I test the same dataset using the same code, I get different selected features from SequentialFeatureSelector, which results in different final model outputs.

To debug, I have tried the following:

Rounded X and y to 4 decimal places to check if slight reading differences were the cause.
Set global seeds (np.random.seed(SEED), random.seed(SEED)) to control randomness.
Explicitly set random_state=SEED in KFold inside SequentialFeatureSelector.
Ran RidgeCV alone (without feature selection, with and without StandardScaler()) and confirmed that it produces the same results on both machines.
Ensure versions for python and all libraries are the same.

Observations:

When I run only RidgeCV, I get identical results on both machines.
When I run SequentialFeatureSelector, it picks different feature sets on local vs. Databricks, causing different model outputs.
I suspect there may be a randomness issue inside SFS or cross-validation that I haven’t accounted for.

Why does SequentialFeatureSelector give different results on local vs. Databricks, despite using the same data and seed, and how to fix it?

# Set a global seed
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# Ridge regression model
ridge_model = RidgeCV(
    alphas=np.logspace(-10, 2, 200),  # Alpha = regularization strength
    fit_intercept=True,  
    store_cv_values=False)

# Model pipeline: Standardization + Ridge Regression
model_pipeline = make_pipeline(StandardScaler(), ridge_model)

# Sequential Feature Selection (SFS)
sfs = SequentialFeatureSelector(
    model_pipeline,
    n_features_to_select='auto',
    direction='forward',
    scoring='r2',
    cv=KFold(n_splits=2, random_state=SEED, shuffle=True))  # Tried setting random_state explicitly and tried shuffle=False

# Fit SFS
sfs.fit(X, y)

# Get selected features
predictors = sfs.get_feature_names_out().tolist()

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Different Feature Selection Results Between Local (Ubuntu VM) and Databricks Using sklearn's SequentialFeatureS

与本文相关的文章

评论列表(0)