最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - K-Means taking a long time - Stack Overflow

programmeradmin1浏览0评论

I'm using k-means for my project for the first time. my dataset has more than 400,000 rows and 11 columns, I run the k-means for k= 3, 5, 7, 9, and 10. it took more than 65 minutes and still no output. is that normal? it's my first time so I'm not sure what to expect

I'm using python, visual studio

sse = []
silhouette_scores = []
k_values = [3, 5, 7, 9]

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=1, init='k-means++')
    kmeans.fit(x_pca)
    sse.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(x_pca, kmeans.labels_))

# the elbow method
plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.show()

# silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

I'm using k-means for my project for the first time. my dataset has more than 400,000 rows and 11 columns, I run the k-means for k= 3, 5, 7, 9, and 10. it took more than 65 minutes and still no output. is that normal? it's my first time so I'm not sure what to expect

I'm using python, visual studio

sse = []
silhouette_scores = []
k_values = [3, 5, 7, 9]

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=1, init='k-means++')
    kmeans.fit(x_pca)
    sse.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(x_pca, kmeans.labels_))

# the elbow method
plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Sum of Squared Errors (SSE)')
plt.show()

# silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_values, silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()
Share Improve this question edited Nov 20, 2024 at 21:47 Joud asked Nov 20, 2024 at 20:46 JoudJoud 73 bronze badges 2
  • Hello and welcome to Stack Overflow! Please take your time to read through How To Ask and edit your question to include a Minimal Reproducible Example of your code. Otherwise it's hard to say anything about the performance problems. – Teemu Risikko Commented Nov 20, 2024 at 21:05
  • Apart from that, when using pandas + sklearn, a clustering like that takes a few seconds at most for fake data. Even with more complex datatypes than some fake data generated with np.random, it should definitely not take that long. – Teemu Risikko Commented Nov 20, 2024 at 21:15
Add a comment  | 

1 Answer 1

Reset to default 2

Analysis

It's not your K-means that is slow, it's silhouette_score.

The time complexity of K-means is

发布评论

评论列表(0)

  1. 暂无评论