最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How can I implement the RFE method on this dataset? - Stack Overflow

programmeradmin1浏览0评论

I am working on an RNA-Seq dataset with two CSV files:

  • data.csv (input), which contains genes ranging from 0 to 20,530.

  • labels.csv (target), which contains the labels. It is a "fat data" dataset, meaning we have more features than observations.

I am trying to apply some feature selection method to reduce the number of features.

To achieve this, I apply two methods:

  1. Recursive Feature Elimination (RFE) to retain the most relevant variables for my model.

The first method implemented..

Voici mon code (SelectKBest avec f_classif )

from sklearn.feature_selection import SelectKBest, f_classif, RFE  # Pour sélectionner les meilleures caractéristiques
 
# Merge data and labels based on the common column
data_labels = pd.merge(data,labels,on="Unnamed: 0")
 
# Now, extract X and y from the merged data
X = data_labels.drop(columns=['Class', 'Unnamed: 0'])  # Features
y = data_labels['Class']  # Target variable
 
selector = SelectKBest(f_classif, k=10)
X_new = selector.fit_transform(X,y)
 
selected_features = X.columns[selector.get_support()]
print(selected_features)

It worked well, I got the following output:

Index(['gene_219', 'gene_220', 'gene_450', 'gene_3737', 'gene_7964', 'gene_9175', 'gene_9176', 'gene_13818', 'gene_14114', 'gene_18135'], dtype='object')

It is also very fast.

However, the second method, RFE, takes a lot of time, and I can't get any output from it. How can I optimize it?

Implementation :

# RFE with RandomForestClassifier
estimator = RandomForestClassifier(random_state=42) 
selector = RFE(estimator, n_features_to_select=10, step=1) # Select top 10 features
X_rfe = selector.fit_transform(X, y)
 
# Get the selected features
selected_features_rfe = X.columns[selector.support_]
selected_features_rfe
发布评论

评论列表(0)

  1. 暂无评论