python - Concatenating TF-IDF Data and Categorical Data for CatBoost Model

I've been trying to concatenate TF-IDF data with categorical data. However, when concatenating, the categorical data is automatically converted to float by default. Since CatBoost doesn't support float for categorical features, this causes an error for sparse data because it's no longer recognized as categorical data.

Is there a solution to this issue? Please find my code below for reference:

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix

text_data = [
    "I love machine learning and data science",
    "Deep learning is a subset of machine learning",
    "Natural language processing is amazing",
    "AI is transforming the world",
    "Big data and AI are revolutionizing industries"
]

categorical_data = {
    "Category": ["Tech", "Tech", "NLP", "AI", "Big Data"],
    "Region": ["US", "Europe", "Asia", "US", "Europe"]
}

y = np.array([0, 1, 0, 1, 1])

df_cat = pd.DataFrame(categorical_data)

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(text_data)

df_cat_encoded = df_cat.apply(LabelEncoder().fit_transform)

X_categorical = csr_matrix(df_cat_encoded.values)

X_combined = hstack([X_tfidf, X_categorical])

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0)

model.fit(X_combined, y, cat_features=[X_tfidf.shape[1], X_tfidf.shape[1] + 1])

predictions = model.predict(X_combined)

print(predictions)

Error:

CatBoostError: 'data' is scipy.sparse.spmatrix of floating point numerical type, 
it means no categorical features, but 'cat_features' parameter specifies nonzero 
number of categorical features

Is there a solution to this issue? Please find my code below for reference:

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix

text_data = [
    "I love machine learning and data science",
    "Deep learning is a subset of machine learning",
    "Natural language processing is amazing",
    "AI is transforming the world",
    "Big data and AI are revolutionizing industries"
]

categorical_data = {
    "Category": ["Tech", "Tech", "NLP", "AI", "Big Data"],
    "Region": ["US", "Europe", "Asia", "US", "Europe"]
}

y = np.array([0, 1, 0, 1, 1])

df_cat = pd.DataFrame(categorical_data)

vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(text_data)

df_cat_encoded = df_cat.apply(LabelEncoder().fit_transform)

X_categorical = csr_matrix(df_cat_encoded.values)

X_combined = hstack([X_tfidf, X_categorical])

model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0)

model.fit(X_combined, y, cat_features=[X_tfidf.shape[1], X_tfidf.shape[1] + 1])

predictions = model.predict(X_combined)

print(predictions)

Error:

CatBoostError: 'data' is scipy.sparse.spmatrix of floating point numerical type, 
it means no categorical features, but 'cat_features' parameter specifies nonzero 
number of categorical features

Share Improve this question edited Mar 24 at 14:09 desertnaut 60.5k32 gold badges155 silver badges181 bronze badges asked Mar 24 at 13:53 Perinban Parameshwaran 1332 gold badges2 silver badges12 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

Scipy sparse arrays have one common type, hence the behavior when you stack. Since pandas dataframes can have separate types per column, that's one solution:

df_tfidf = pd.DataFrame.sparse.from_spmatrix(
    X_tfidf,
    columns=vectorizer.get_feature_names_out(),
)

X_combined = pd.concat([df_tfidf, df_cat_encoded], axis=1)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Concatenating TF-IDF Data and Categorical Data for CatBoost Model - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)