I'm using k-means for my project for the first time. my dataset has more than 400,000 rows and 11 columns, I run the k-means for k= 3, 5, 7, 9, and 10. it took more than 65 minutes and still no output. is that normal? it's my first time so I'm not sure what to expect
I'm using python, visual studio
sse = []
silhouette_scores = []
k_values = [3, 5, 7, 9]
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=1, init='k-means++')
kmeans.fit(x_pca)
sse.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(x_pca, kmeans.labels_))
# the elbow method
plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.show()
# silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()
I'm using k-means for my project for the first time. my dataset has more than 400,000 rows and 11 columns, I run the k-means for k= 3, 5, 7, 9, and 10. it took more than 65 minutes and still no output. is that normal? it's my first time so I'm not sure what to expect
I'm using python, visual studio
sse = []
silhouette_scores = []
k_values = [3, 5, 7, 9]
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=1, init='k-means++')
kmeans.fit(x_pca)
sse.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(x_pca, kmeans.labels_))
# the elbow method
plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.show()
# silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()
Share
Improve this question
edited Nov 20, 2024 at 21:47
Joud
asked Nov 20, 2024 at 20:46
JoudJoud
73 bronze badges
2
|
1 Answer
Reset to default 2Analysis
It's not your K-means that is slow, it's silhouette_score
.
The time complexity of K-means is
np.random
, it should definitely not take that long. – Teemu Risikko Commented Nov 20, 2024 at 21:15