最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Why is Dask slower than Pandas in computing the mean of a large dataset, and how can I improve performance? - Stack Ove

programmeradmin1浏览0评论

I am learning Dask to make my Python projects more efficient and scalable. To understand its performance better, I wrote a script comparing the computation time of Pandas and Dask when calculating the mean of a column in a large dataset. Here's my code:

import pandas as pd
import dask.dataframe as dd
import time
from memory_profiler import memory_usage

filename = "large_dataset_3.csv"

df_pd = pd.read_csv(filename)
df_dask = dd.read_csv(filename, blocksize=75e6)

start = time.time()
mean_pd = df_pd["points"].mean()
stop = time.time()
print(f"Pandas Mean Computation Time {stop - start:.5f} seconds")

start = time.time()
mean_dask = df_dask["points"].mean()pute(num_workers=4)
stop = time.time()
print(f"Dask Mean Computation Time {stop - start:.5f} seconds")

When I run this script, I find that Pandas computes the mean in about 0.02 seconds, while Dask takes more than 4.5 seconds. This result is surprising because I expected Dask to be faster due to its parallel processing capabilities.

For context:

The dataset (large_dataset_3.csv) contains 100 million rows, with a total size of 292.4 MB.

My system specs are:

Processor: Intel® Core™ i5-8365U × 8 (4 cores, 8 threads)

RAM: 16 GB

My Questions:

Why is Dask slower than Pandas in this scenario? Are there optimizations or configurations I can apply to make Dask perform better?

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论