最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Length of features is not equal to the length of SHAP Values - Stack Overflow

programmeradmin2浏览0评论

Im running a random forest model and to get some feature importance and Im trying to run a SHAP analysis. The problem is that every time I try to plot the shap values, I keep getting this error:

DimensionError: Length of features is not equal to the length of shap_values. 

I don't know whats going on. When I run my XGBoost model, everything seems to go fine, i can see the SHAP plot for the data set. Its the exact same data set but it just wont run with random forest. Its for a binary classification.

Here is my python code:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Remove the primary key column 'id' from the features

features = result.drop(columns=['PQ2', 'id'])  # Drop target and ID columns
target = result['PQ2']  # Target variable

# Split data into training and testing sets with 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
 
# Initialize Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

import shap

# Create a Tree SHAP explainer for the Random Forest model
explainer = shap.TreeExplainer(rf_model)

# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Plot a SHAP summary plot
shap.summary_plot(shap_values, X_test, feature_names=features_names)

# Plot a SHAP bar plot for global feature importance

shap.summary_plot(shap_values, X_test, feature_names=features_names, plot_type="bar")

The shape of test set is (829,22), yet the SHAP values consistently return (22,2) for random forest and I dont know how to fix it. The data set has been preprocessed, columns are either 0-1s or numerical columns.

发布评论

评论列表(0)

  1. 暂无评论