最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Kernel-stop error occurred during shap-value calculation - Stack Overflow

programmeradmin2浏览0评论

I am trying to predict the occurrence of a certain phenomenon using about 9500 persons’ data with 36-37 predictor items using sklearn. It is a binary classification problem (occur or not-occur). Ultimately I want to predict whether this will happen to that person using these 36-37 predictors.

The data was divided into 80% training data and 20% test data, and trained using a random forest model, then evaluated using the test data and the hyperparameters were adjusted using Gridsearch CV. Finally, when I tried to plot the effect of each feature variable on the prediction using Shapley values, the program suddenly stopped during the calculation of "shap-value" with the error message Kernel stopped. Rebooting.... We used TreeSHAP.

I tried coding and adding exception handling, and investigated in debug mode using Spyder, but I couldn’t find any details about the kernel stop other than that it stopped in the shap_value library. The same kernel stop occurred in JupyterLab too.

Initially, when running the program with 37 predictors, the calculation of the shap-value and the subsequent plot of the shap values were performed without any problems. We could plot SHAP summary, beeswarm, force, waterfall and dependency plots. It does not seem to be a bug in the program or a lack of memory. When I reduced one item to 36 items (irrespective of any predictor that I reduce), the above Kernel stop error started to occur.

How to resolve this error?

  • Only one predictor is a quantitative (numeric) variable, and the rest are all categorical (nominal/ordinal) variables.
  • Python (3.11.11), Scikit-learn (1.5.1), and shap (0.46.0) were installed with Anaconda 3.
  • Windows 11 is the OS used
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

Length of X_train: 7516
Length of X_test: 1880
#Preprocessing Pipeline
num_tuple = (scaler, num_clms) #number items 
cat_tuple = (onehot, cat_clms) #category items
# Preprocessor object
preprocessor = make_column_transformer(num_tuple,
                                      cat_tuple,
                                      sparse_threshold=0,
                                      verbose_feature_names_out=False)
#Baseline Models
#Random Forest
## Instantiate and fit random forest model
rf = RandomForestClassifier()
rf_pipe = make_pipeline(preprocessor, rf)
rf_pipe.fit(X_train, y_train)
#evaluate classification of RandomForestClassifier

#Hyperparameter Tuning with GridSearch CV
#Random Forest 
## Get model parameters
rf_pipe.get_params()
## Parameters to be tested
param_grid = {'randomforestclassifier__class_weight': ['balanced'],
              'randomforestclassifier__max_depth': [10, 100, 1000],
              'randomforestclassifier__max_leaf_nodes': [10, 100, 1000],
              'randomforestclassifier__n_estimators': [10, 100]} 

## Fit and evaluate
rf_gs = GridSearchCV(rf_pipe, param_grid, verbose=2)
rf_gs.fit(X_train, y_train)
rf_gs.best_params_
## Fit best estimator
best_rf_gs = rf_gs.best_estimator_
best_rf_gs.fit(X_train, y_train)

## Evaluate tuned model
#evaluate classification of "RandomForest best estimator"

best_rf_gs

## Instantiate and fit random forest model
rf = RandomForestClassifier(class_weight='balanced', max_depth=1000, 
                            n_estimators=10, max_leaf_nodes=10
                            )
rf_pipe = make_pipeline(preprocessor, rf)
rf_pipe.fit(X_train, y_train)
## Evaluate tuned model
#evaluate classification of "RandomForest Evaluate tuned model"
# Define models
models = {
    'Random Forest': RandomForestClassifier(class_weight='balanced', 
                                            max_depth=1000,
                                            n_estimators=10, max_leaf_nodes=10
                                            )
}

# Assess models
#Assess models with preprocessor and display the report
#Final Model Explanation
# Access the RandomForestClassifier from the pipeline

rf_classifier = best_rf_gs.named_steps['randomforestclassifier']

# Create a SHAP explainer
explainer = shap.TreeExplainer(rf_classifier) 
try:
shap_values = explainer(X_test, check_additivity=False) #Kernel stop error here
except:
traceback.print_exc()

print(explainer)
type(shap_values)

print(np.shape(shap_values))

#shap bar plot 
clustering = shap.utils.hclust(X_test, y_test) shap.plots.bar(shap_values[:,:,0], max_display=40, clustering=clustering, clustering_cutoff= 0.5 )

Other than the above, we tried the following. I searched the web and tried the following measures, but none of them led to any improvement.

  1. I introduced exception handling, but we could not obtain any information when/where the problem occurred.
  2. I tried updating the versions of shap, scikit-learn, and python to the latest, but still the errors were there.
  3. I tried reducing the data size (to half the number of people), but there was no improvement. Also, changing the number of predictors from 36 to 30 or 20, did not improve the situation.
  4. I considered the possibility of insufficient memory, so I checked the usage in the task manager. I tried running the program while checking with a debugger, but I could not obtain any useful information. Since the program works well with 37 items, it is unlikely that there is simply a lack of memory or that’s the only reason for the subsequent errors.
  5. I also checked for coding mistakes using a debugger, print statements, etc. As with the above items, there were cases where the program works well, so it does not seem to be a simple coding mistake too.
发布评论

评论列表(0)

  1. 暂无评论