I've been trying to concatenate TF-IDF data with categorical data. However, when concatenating, the categorical data is automatically converted to float by default. Since CatBoost doesn't support float for categorical features, this causes an error for sparse data because it's no longer recognized as categorical data.
Is there a solution to this issue? Please find my code below for reference:
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix
text_data = [
"I love machine learning and data science",
"Deep learning is a subset of machine learning",
"Natural language processing is amazing",
"AI is transforming the world",
"Big data and AI are revolutionizing industries"
]
categorical_data = {
"Category": ["Tech", "Tech", "NLP", "AI", "Big Data"],
"Region": ["US", "Europe", "Asia", "US", "Europe"]
}
y = np.array([0, 1, 0, 1, 1])
df_cat = pd.DataFrame(categorical_data)
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(text_data)
df_cat_encoded = df_cat.apply(LabelEncoder().fit_transform)
X_categorical = csr_matrix(df_cat_encoded.values)
X_combined = hstack([X_tfidf, X_categorical])
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0)
model.fit(X_combined, y, cat_features=[X_tfidf.shape[1], X_tfidf.shape[1] + 1])
predictions = model.predict(X_combined)
print(predictions)
Error:
CatBoostError: 'data' is scipy.sparse.spmatrix of floating point numerical type,
it means no categorical features, but 'cat_features' parameter specifies nonzero
number of categorical features
I've been trying to concatenate TF-IDF data with categorical data. However, when concatenating, the categorical data is automatically converted to float by default. Since CatBoost doesn't support float for categorical features, this causes an error for sparse data because it's no longer recognized as categorical data.
Is there a solution to this issue? Please find my code below for reference:
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from scipy.sparse import hstack, csr_matrix
text_data = [
"I love machine learning and data science",
"Deep learning is a subset of machine learning",
"Natural language processing is amazing",
"AI is transforming the world",
"Big data and AI are revolutionizing industries"
]
categorical_data = {
"Category": ["Tech", "Tech", "NLP", "AI", "Big Data"],
"Region": ["US", "Europe", "Asia", "US", "Europe"]
}
y = np.array([0, 1, 0, 1, 1])
df_cat = pd.DataFrame(categorical_data)
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(text_data)
df_cat_encoded = df_cat.apply(LabelEncoder().fit_transform)
X_categorical = csr_matrix(df_cat_encoded.values)
X_combined = hstack([X_tfidf, X_categorical])
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0)
model.fit(X_combined, y, cat_features=[X_tfidf.shape[1], X_tfidf.shape[1] + 1])
predictions = model.predict(X_combined)
print(predictions)
Error:
CatBoostError: 'data' is scipy.sparse.spmatrix of floating point numerical type,
it means no categorical features, but 'cat_features' parameter specifies nonzero
number of categorical features
Share
Improve this question
edited Mar 24 at 14:09
desertnaut
60.5k32 gold badges155 silver badges181 bronze badges
asked Mar 24 at 13:53
Perinban ParameshwaranPerinban Parameshwaran
1332 gold badges2 silver badges12 bronze badges
1 Answer
Reset to default 1Scipy sparse arrays have one common type, hence the behavior when you stack. Since pandas dataframes can have separate types per column, that's one solution:
df_tfidf = pd.DataFrame.sparse.from_spmatrix(
X_tfidf,
columns=vectorizer.get_feature_names_out(),
)
X_combined = pd.concat([df_tfidf, df_cat_encoded], axis=1)