最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Different Feature Selection Results Between Local (Ubuntu VM) and Databricks Using sklearn's SequentialFeatureS

programmeradmin1浏览0评论

I am migrating from running my machine learning pipeline in VS Code with Ubuntu on a VM into Databricks. When I test the same dataset using the same code, I get different selected features from SequentialFeatureSelector, which results in different final model outputs.

To debug, I have tried the following:

  • Rounded X and y to 4 decimal places to check if slight reading differences were the cause.
  • Set global seeds (np.random.seed(SEED), random.seed(SEED)) to control randomness.
  • Explicitly set random_state=SEED in KFold inside SequentialFeatureSelector.
  • Ran RidgeCV alone (without feature selection, with and without StandardScaler()) and confirmed that it produces the same results on both machines.
  • Ensure versions for python and all libraries are the same.

Observations:

  • When I run only RidgeCV, I get identical results on both machines.
  • When I run SequentialFeatureSelector, it picks different feature sets on local vs. Databricks, causing different model outputs.
  • I suspect there may be a randomness issue inside SFS or cross-validation that I haven’t accounted for.

Why does SequentialFeatureSelector give different results on local vs. Databricks, despite using the same data and seed, and how to fix it?

# Set a global seed
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# Ridge regression model
ridge_model = RidgeCV(
    alphas=np.logspace(-10, 2, 200),  # Alpha = regularization strength
    fit_intercept=True,  
    store_cv_values=False)

# Model pipeline: Standardization + Ridge Regression
model_pipeline = make_pipeline(StandardScaler(), ridge_model)

# Sequential Feature Selection (SFS)
sfs = SequentialFeatureSelector(
    model_pipeline,
    n_features_to_select='auto',
    direction='forward',
    scoring='r2',
    cv=KFold(n_splits=2, random_state=SEED, shuffle=True))  # Tried setting random_state explicitly and tried shuffle=False

# Fit SFS
sfs.fit(X, y)

# Get selected features
predictors = sfs.get_feature_names_out().tolist()

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论