I am learning Dask to make my Python projects more efficient and scalable. To understand its performance better, I wrote a script comparing the computation time of Pandas and Dask when calculating the mean of a column in a large dataset. Here's my code:
import pandas as pd
import dask.dataframe as dd
import time
from memory_profiler import memory_usage
filename = "large_dataset_3.csv"
df_pd = pd.read_csv(filename)
df_dask = dd.read_csv(filename, blocksize=75e6)
start = time.time()
mean_pd = df_pd["points"].mean()
stop = time.time()
print(f"Pandas Mean Computation Time {stop - start:.5f} seconds")
start = time.time()
mean_dask = df_dask["points"].mean()pute(num_workers=4)
stop = time.time()
print(f"Dask Mean Computation Time {stop - start:.5f} seconds")
When I run this script, I find that Pandas computes the mean in about 0.02 seconds, while Dask takes more than 4.5 seconds. This result is surprising because I expected Dask to be faster due to its parallel processing capabilities.
For context:
The dataset (large_dataset_3.csv) contains 100 million rows, with a total size of 292.4 MB.
My system specs are:
Processor: Intel® Core™ i5-8365U × 8 (4 cores, 8 threads)
RAM: 16 GB
My Questions:
Why is Dask slower than Pandas in this scenario? Are there optimizations or configurations I can apply to make Dask perform better?