python - How to efficiently perform pandas groupby and aggregation on large datasets with memory constraints?

I'm processing a dataset with over 10 thousand rows in pandas, and I need to perform groupby and aggregation operations to summarize the data. However, my system has limited memory (8GB RAM), and the current approach is running into performance issues. I've tried chunking the data and using Dask for parallel processing, but the improvements are minimal. For example, my dataset looks like this:

ID  | Value  
----|------  
1   | 100  
2   | 200  
1   | 300

and my code of aggregation:

import pandas as pd

data = pd.read_csv('large_file.csv')
result = data.groupby('ID').agg({'Value': 'sum'})

Are there alternative approaches or tools I could use to further optimize the performance of these operations?

ID  | Value  
----|------  
1   | 100  
2   | 200  
1   | 300

and my code of aggregation:

import pandas as pd

data = pd.read_csv('large_file.csv')
result = data.groupby('ID').agg({'Value': 'sum'})

Are there alternative approaches or tools I could use to further optimize the performance of these operations?

Share Improve this question asked Mar 24 at 5:51 ZOMBIE_JERKY 1115 bronze badges

2 If your data is JUST 'id' and 'value' there should be no case where a simple dataset of 10k would be limited by 8GB of memory. You mentioned performance issues, how long does it take for your code to execute on the 10k dataset ? (also how large is you file ? if its just ids and values I find it had to imagine its over 15mb for 10k rows) – Karan Shishoo Commented Mar 24 at 9:20
1 10,000 rows is not a large amount of data. What exactly is your performance issue? – user19077881 Commented Mar 24 at 9:25
10k rows of data like you present would take less than 100k of memory and about 0.01 seconds to produce a result with your current code. Can you elaborate on your performance issue? – JonSG Commented Mar 24 at 14:21
Using 10k rows of random data, Polars is about 4x faster then Pandas. Both take a very small fraction of a second. – user19077881 Commented Mar 24 at 16:50

Add a comment |

1 Answer 1

Sorted by: Reset to default -1

If you have memory issues, don't use pandas for such a simple task. If the task really just summing up the second value to groups of the first, just do:

sums = {}
with open("file.csv","r") as infile:
    line = infile.readline()
    while not line == "":
        values = line.split(";") ## Use the appropriate separator here
        lineid, linevalue = values[0],values[1]
        try:
            sums[lineid] += int(linevalue)
        except:
            sums[lineid] = int(linevalue)
        line = infile.readline()

result = {"ID":[],"Value":[]}
for i in sorted(list(sums.keys())):
    outdata["ID"].append(i)
    outdata["Value"].append(sums[i])

If you really need to, you can then transform the dictionary result to a pandas df. But you could also just write the result to a new CSV and be done with it.

This should work for very large files, until the sums dictionary causes a memory overflow.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to efficiently perform pandas groupby and aggregation on large datasets with memory constraints? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)