I am working on an RNA-Seq dataset with two CSV files:
data.csv
(input), which contains genes ranging from 0 to 20,530.labels.csv
(target), which contains the labels. It is a "fat data" dataset, meaning we have more features than observations.
I am trying to apply some feature selection method to reduce the number of features.
To achieve this, I apply two methods:
- Recursive Feature Elimination (RFE) to retain the most relevant variables for my model.
The first method implemented..
Voici mon code (SelectKBest avec f_classif )
from sklearn.feature_selection import SelectKBest, f_classif, RFE # Pour sélectionner les meilleures caractéristiques
# Merge data and labels based on the common column
data_labels = pd.merge(data,labels,on="Unnamed: 0")
# Now, extract X and y from the merged data
X = data_labels.drop(columns=['Class', 'Unnamed: 0']) # Features
y = data_labels['Class'] # Target variable
selector = SelectKBest(f_classif, k=10)
X_new = selector.fit_transform(X,y)
selected_features = X.columns[selector.get_support()]
print(selected_features)
It worked well, I got the following output:
Index(['gene_219', 'gene_220', 'gene_450', 'gene_3737', 'gene_7964', 'gene_9175', 'gene_9176', 'gene_13818', 'gene_14114', 'gene_18135'], dtype='object')
It is also very fast.
However, the second method, RFE, takes a lot of time, and I can't get any output from it. How can I optimize it?
Implementation :
# RFE with RandomForestClassifier
estimator = RandomForestClassifier(random_state=42)
selector = RFE(estimator, n_features_to_select=10, step=1) # Select top 10 features
X_rfe = selector.fit_transform(X, y)
# Get the selected features
selected_features_rfe = X.columns[selector.support_]
selected_features_rfe