最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to efficiently perform pandas groupby and aggregation on large datasets with memory constraints? - Stack Overflow

programmeradmin3浏览0评论

I'm processing a dataset with over 10 thousand rows in pandas, and I need to perform groupby and aggregation operations to summarize the data. However, my system has limited memory (8GB RAM), and the current approach is running into performance issues. I've tried chunking the data and using Dask for parallel processing, but the improvements are minimal. For example, my dataset looks like this:

ID  | Value  
----|------  
1   | 100  
2   | 200  
1   | 300  

and my code of aggregation:

import pandas as pd

data = pd.read_csv('large_file.csv')
result = data.groupby('ID').agg({'Value': 'sum'})

Are there alternative approaches or tools I could use to further optimize the performance of these operations?

I'm processing a dataset with over 10 thousand rows in pandas, and I need to perform groupby and aggregation operations to summarize the data. However, my system has limited memory (8GB RAM), and the current approach is running into performance issues. I've tried chunking the data and using Dask for parallel processing, but the improvements are minimal. For example, my dataset looks like this:

ID  | Value  
----|------  
1   | 100  
2   | 200  
1   | 300  

and my code of aggregation:

import pandas as pd

data = pd.read_csv('large_file.csv')
result = data.groupby('ID').agg({'Value': 'sum'})

Are there alternative approaches or tools I could use to further optimize the performance of these operations?

Share Improve this question asked Mar 24 at 5:51 ZOMBIE_JERKYZOMBIE_JERKY 1115 bronze badges 4
  • 2 If your data is JUST 'id' and 'value' there should be no case where a simple dataset of 10k would be limited by 8GB of memory. You mentioned performance issues, how long does it take for your code to execute on the 10k dataset ? (also how large is you file ? if its just ids and values I find it had to imagine its over 15mb for 10k rows) – Karan Shishoo Commented Mar 24 at 9:20
  • 1 10,000 rows is not a large amount of data. What exactly is your performance issue? – user19077881 Commented Mar 24 at 9:25
  • 10k rows of data like you present would take less than 100k of memory and about 0.01 seconds to produce a result with your current code. Can you elaborate on your performance issue? – JonSG Commented Mar 24 at 14:21
  • Using 10k rows of random data, Polars is about 4x faster then Pandas. Both take a very small fraction of a second. – user19077881 Commented Mar 24 at 16:50
Add a comment  | 

1 Answer 1

Reset to default -1

If you have memory issues, don't use pandas for such a simple task. If the task really just summing up the second value to groups of the first, just do:

sums = {}
with open("file.csv","r") as infile:
    line = infile.readline()
    while not line == "":
        values = line.split(";") ## Use the appropriate separator here
        lineid, linevalue = values[0],values[1]
        try:
            sums[lineid] += int(linevalue)
        except:
            sums[lineid] = int(linevalue)
        line = infile.readline()

result = {"ID":[],"Value":[]}
for i in sorted(list(sums.keys())):
    outdata["ID"].append(i)
    outdata["Value"].append(sums[i])

If you really need to, you can then transform the dictionary result to a pandas df. But you could also just write the result to a new CSV and be done with it.

This should work for very large files, until the sums dictionary causes a memory overflow.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论