Im running a random forest model and to get some feature importance and Im trying to run a SHAP analysis. The problem is that every time I try to plot the shap values, I keep getting this error:
DimensionError: Length of features is not equal to the length of shap_values.
I don't know whats going on. When I run my XGBoost model, everything seems to go fine, i can see the SHAP plot for the data set. Its the exact same data set but it just wont run with random forest. Its for a binary classification.
Here is my python code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Remove the primary key column 'id' from the features
features = result.drop(columns=['PQ2', 'id']) # Drop target and ID columns
target = result['PQ2'] # Target variable
# Split data into training and testing sets with 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# Initialize Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Fit the model on the training data
rf_model.fit(X_train, y_train)
# Make predictions
y_pred = rf_model.predict(X_test)
import shap
# Create a Tree SHAP explainer for the Random Forest model
explainer = shap.TreeExplainer(rf_model)
# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)
# Plot a SHAP summary plot
shap.summary_plot(shap_values, X_test, feature_names=features_names)
# Plot a SHAP bar plot for global feature importance
shap.summary_plot(shap_values, X_test, feature_names=features_names, plot_type="bar")
The shape of test set is (829,22), yet the SHAP values consistently return (22,2) for random forest and I dont know how to fix it. The data set has been preprocessed, columns are either 0-1s or numerical columns.