I am migrating from running my machine learning pipeline in VS Code with Ubuntu on a VM into Databricks. When I test the same dataset using the same code, I get different selected features from SequentialFeatureSelector, which results in different final model outputs.
To debug, I have tried the following:
- Rounded X and y to 4 decimal places to check if slight reading differences were the cause.
- Set global seeds (np.random.seed(SEED), random.seed(SEED)) to control randomness.
- Explicitly set random_state=SEED in KFold inside SequentialFeatureSelector.
- Ran RidgeCV alone (without feature selection, with and without StandardScaler()) and confirmed that it produces the same results on both machines.
- Ensure versions for python and all libraries are the same.
Observations:
- When I run only RidgeCV, I get identical results on both machines.
- When I run SequentialFeatureSelector, it picks different feature sets on local vs. Databricks, causing different model outputs.
- I suspect there may be a randomness issue inside SFS or cross-validation that I haven’t accounted for.
Why does SequentialFeatureSelector give different results on local vs. Databricks, despite using the same data and seed, and how to fix it?
# Set a global seed
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
# Ridge regression model
ridge_model = RidgeCV(
alphas=np.logspace(-10, 2, 200), # Alpha = regularization strength
fit_intercept=True,
store_cv_values=False)
# Model pipeline: Standardization + Ridge Regression
model_pipeline = make_pipeline(StandardScaler(), ridge_model)
# Sequential Feature Selection (SFS)
sfs = SequentialFeatureSelector(
model_pipeline,
n_features_to_select='auto',
direction='forward',
scoring='r2',
cv=KFold(n_splits=2, random_state=SEED, shuffle=True)) # Tried setting random_state explicitly and tried shuffle=False
# Fit SFS
sfs.fit(X, y)
# Get selected features
predictors = sfs.get_feature_names_out().tolist()