最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - SHAP plot with categorical columns - Stack Overflow

programmeradmin3浏览0评论

Before one-hot encoding, the input data consists of 2 categorical columns (category1, category2). category1 is among A,B,C and category2 is among X,Y. After the one-hot encoding, the input data transforms to 5 columns(A,B,C,X,Y). The fit function works well.

However, the problem lies on the output SHAP summary plot. I want to SHAP plot show about 2 input columns (category1, category2) but actually the SHAP plot shows 5 columns(A,B,C,X,Y) (picture below).

How can I do that? Here is the working code. I guess there my be some more parameter for TreeExplainer for specifying categorical columns so that permutation cost goes down, but I have no idea.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearnpose import ColumnTransformer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
import shap

# example data (2 input columns, 1 output)
data = {
    'category1': ['A', 'B', 'C', 'A', 'B'],
    'category2': ['X', 'Y', 'X', 'Y', 'X'],
    'target': [0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)

# category columns
categorical_features = ['category1', 'category2']

# OneHotEncoder 
transformer = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(sparse_output=False), categorical_features)
    ],
    remainder='passthrough'
)

# data preparation
X = df.drop('target', axis=1)
y = df['target']

# encoding
# after encoding, (2+3 columns and 1 output)
X_encoded = transformer.fit_transform(X)

# split data
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42
)

# train model
model = XGBClassifier()
model.fit(X_train, y_train)

#-------------------
# Explain the model's predictions using SHAP
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualize SHAP values with feature names
feature_names = transformer.get_feature_names_out(categorical_features)
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
发布评论

评论列表(0)

  1. 暂无评论