最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Concatenating TF-IDF Data and Categorical Data for CatBoost Model - Stack Overflow

programmeradmin4浏览0评论

I've been trying to concatenate TF-IDF data with categorical data. However, when concatenating, the categorical data is automatically converted to float by default. Since CatBoost doesn't support float for categorical features, this causes an error for sparse data because it's no longer recognized as categorical data.

Is there a solution to this issue? Please find my code below for reference:

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix

text_data = [
    "I love machine learning and data science",
    "Deep learning is a subset of machine learning",
    "Natural language processing is amazing",
    "AI is transforming the world",
    "Big data and AI are revolutionizing industries"
]

categorical_data = {
    "Category": ["Tech", "Tech", "NLP", "AI", "Big Data"],
    "Region": ["US", "Europe", "Asia", "US", "Europe"]
}

y = np.array([0, 1, 0, 1, 1])

df_cat = pd.DataFrame(categorical_data)

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(text_data)

df_cat_encoded = df_cat.apply(LabelEncoder().fit_transform)

X_categorical = csr_matrix(df_cat_encoded.values)

X_combined = hstack([X_tfidf, X_categorical])

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0)

model.fit(X_combined, y, cat_features=[X_tfidf.shape[1], X_tfidf.shape[1] + 1])

predictions = model.predict(X_combined)

print(predictions)

Error:

CatBoostError: 'data' is scipy.sparse.spmatrix of floating point numerical type, 
it means no categorical features, but 'cat_features' parameter specifies nonzero 
number of categorical features

I've been trying to concatenate TF-IDF data with categorical data. However, when concatenating, the categorical data is automatically converted to float by default. Since CatBoost doesn't support float for categorical features, this causes an error for sparse data because it's no longer recognized as categorical data.

Is there a solution to this issue? Please find my code below for reference:

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix

text_data = [
    "I love machine learning and data science",
    "Deep learning is a subset of machine learning",
    "Natural language processing is amazing",
    "AI is transforming the world",
    "Big data and AI are revolutionizing industries"
]

categorical_data = {
    "Category": ["Tech", "Tech", "NLP", "AI", "Big Data"],
    "Region": ["US", "Europe", "Asia", "US", "Europe"]
}

y = np.array([0, 1, 0, 1, 1])

df_cat = pd.DataFrame(categorical_data)

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(text_data)

df_cat_encoded = df_cat.apply(LabelEncoder().fit_transform)

X_categorical = csr_matrix(df_cat_encoded.values)

X_combined = hstack([X_tfidf, X_categorical])

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0)

model.fit(X_combined, y, cat_features=[X_tfidf.shape[1], X_tfidf.shape[1] + 1])

predictions = model.predict(X_combined)

print(predictions)

Error:

CatBoostError: 'data' is scipy.sparse.spmatrix of floating point numerical type, 
it means no categorical features, but 'cat_features' parameter specifies nonzero 
number of categorical features
Share Improve this question edited Mar 24 at 14:09 desertnaut 60.5k32 gold badges155 silver badges181 bronze badges asked Mar 24 at 13:53 Perinban ParameshwaranPerinban Parameshwaran 1332 gold badges2 silver badges12 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

Scipy sparse arrays have one common type, hence the behavior when you stack. Since pandas dataframes can have separate types per column, that's one solution:

df_tfidf = pd.DataFrame.sparse.from_spmatrix(
    X_tfidf,
    columns=vectorizer.get_feature_names_out(),
)

X_combined = pd.concat([df_tfidf, df_cat_encoded], axis=1)
发布评论

评论列表(0)

  1. 暂无评论