python - Why does SequentialFeatureSelector return at most "n_features_in_ - 1" predictors?

I have a training dataset with six features and I am using SequentialFeatureSelector to find an "optimal" subset of the features for a linear regression model. The following code returns three features, which I will call X1, X2, X3.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=0.05, direction='forward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

To check the results, I decided to run the same code using the subset of features X1, X2, X3 instead of X_train. I was expecting to see the features X1, X2, X3 returned again, but instead it was only the features X1, X2. Similarly, using these two features again in the same code returned only X1. It seems that the behavior of sfs is always to return a proper subset of the input features with at most n_features_in_ - 1 columns, but I cannot seem to find this information in the scikit-learn docs. Is this correct, and if so, what is the reasoning for not allowing sfs to return the full set of features?

I also checked to see if using backward selection would return a full feature set.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=1000, direction='backward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

I set the threshold tol to be a large value in the hope that there would be no satisfactory improvement from the full set of features of X_train. But, instead of returning the six original features, it only returned five. The docs simply state

If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.

So it seems that the full feature set is not being considered during cross-validation, and the behavior of sfs is different at the very end of a forward selection or at the very beginning of a backwards selection. If the full set of features outperforms any proper subset of the features, then don't we want sfs to return that possibility? Is there a standard method to compare a selected proper subset of the features and the full set of features using cross-validation?

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=0.05, direction='forward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

I also checked to see if using backward selection would return a full feature set.

sfs = SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto', 
                                tol=1000, direction='backward', 
                                scoring='neg_root_mean_squared_error', cv=8)
sfs.fit_transform(X_train, y_train)

If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing.

Share Improve this question edited Mar 24 at 8:16 Sandipan Dey 23.3k4 gold badges57 silver badges71 bronze badges asked Mar 23 at 12:16 CodingLikeAFox 231 silver badge5 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

Check the source code, lines 240-46 inside the method fit():

if self.n_features_to_select == "auto":
    if self.tol is not None:
        # With auto feature selection, `n_features_to_select_` will be updated
        # to `support_.sum()` after features are selected.
        self.n_features_to_select_ = n_features - 1
    else:
        self.n_features_to_select_ = n_features // 2

As can be seen, even with auto selection mode and a given tol, maximum numbers of features that can be added is bounded by n_features - 1 for some reason (may be we can report this issue in github).

We can override the implementation in the following way, by defining a function get_best_new_feature_score() (similar to the method _get_best_new_feature_score() from the source code), as shown below:

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import cross_val_score

def get_best_new_feature_score(estimator, X, y, cv, current_mask, direction, scoring):
    candidate_feature_indices = np.flatnonzero(~current_mask)
    scores = {}
    for feature_idx in candidate_feature_indices:
        candidate_mask = current_mask.copy()
        candidate_mask[feature_idx] = True
        if direction == "backward":
            candidate_mask = ~candidate_mask
        X_new = X[:, candidate_mask]
        scores[feature_idx] = cross_val_score(
            estimator,
            X_new,
            y,
            cv=cv,
            scoring=scoring
        ).mean()
    new_feature_idx = max(scores, key=lambda feature_idx: scores[feature_idx])
    return new_feature_idx, scores[new_feature_idx]

Now, let's implement the auto (forward) selection, using a regression dataset with 5 features, let' add all the features one-by-one, reporting the improvement in score and stopping by comparing with provided tol:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

X, y = make_regression(n_features=5) # data to be used
X.shape 
# (100, 5)
lm = LinearRegression() # model to be used

# now implement 'auto' feature selection (forward selection)   
cur_mask = np.zeros(X.shape[1]).astype(bool) # no feature selected initially
cv, direction, scoring = 8, 'forward', 'neg_root_mean_squared_error'
tol = 1 # if score improvement > tol, feature will be added in forward selection
old_score = -np.inf
ids, scores = [], []
for i in range(X.shape[1]):
    idx, new_score = get_best_new_feature_score(lm, X, y, current_mask=cur_mask, cv=cv, direction=direction, scoring=scoring)
    print(new_score - old_score, tol, score - old_score > tol)
    if (new_score - old_score) > tol:
        cur_mask[idx] = True
        ids.append(idx)
        scores.append(new_score)
        old_score = new_score
        print(f'feature {idx} added, CV score {score}, mask {cur_mask}')

# feature 3 added, CV score -90.66899644023539, mask [False False False  True False]
# feature 1 added, CV score -59.21188041830155, mask [False  True False  True False]
# feature 2 added, CV score -16.709218665372905, mask [False  True  True  True False]
# feature 4 added, CV score -3.1862116620446166, mask [False  True  True  True  True]
# feature 0 added, CV score -1.4011801838814216e-13, mask [ True  True  True  True  True]

If tol=10, set to 10 instead, then only 4 features will be added in forward-selection. Similarly, if tol=20, then only 3 features will be added in forward-selection, as expected.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Why does SequentialFeatureSelector return at most "n_features_in_ - 1" predictors? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)