I'm processing a dataset with over 10 thousand rows in pandas
, and I need to perform groupby and aggregation operations to summarize the data. However, my system has limited memory (8GB RAM), and the current approach is running into performance issues. I've tried chunking the data and using Dask
for parallel processing, but the improvements are minimal.
For example, my dataset looks like this:
ID | Value
----|------
1 | 100
2 | 200
1 | 300
and my code of aggregation:
import pandas as pd
data = pd.read_csv('large_file.csv')
result = data.groupby('ID').agg({'Value': 'sum'})
Are there alternative approaches or tools I could use to further optimize the performance of these operations?
I'm processing a dataset with over 10 thousand rows in pandas
, and I need to perform groupby and aggregation operations to summarize the data. However, my system has limited memory (8GB RAM), and the current approach is running into performance issues. I've tried chunking the data and using Dask
for parallel processing, but the improvements are minimal.
For example, my dataset looks like this:
ID | Value
----|------
1 | 100
2 | 200
1 | 300
and my code of aggregation:
import pandas as pd
data = pd.read_csv('large_file.csv')
result = data.groupby('ID').agg({'Value': 'sum'})
Are there alternative approaches or tools I could use to further optimize the performance of these operations?
Share Improve this question asked Mar 24 at 5:51 ZOMBIE_JERKYZOMBIE_JERKY 1115 bronze badges 4- 2 If your data is JUST 'id' and 'value' there should be no case where a simple dataset of 10k would be limited by 8GB of memory. You mentioned performance issues, how long does it take for your code to execute on the 10k dataset ? (also how large is you file ? if its just ids and values I find it had to imagine its over 15mb for 10k rows) – Karan Shishoo Commented Mar 24 at 9:20
- 1 10,000 rows is not a large amount of data. What exactly is your performance issue? – user19077881 Commented Mar 24 at 9:25
- 10k rows of data like you present would take less than 100k of memory and about 0.01 seconds to produce a result with your current code. Can you elaborate on your performance issue? – JonSG Commented Mar 24 at 14:21
- Using 10k rows of random data, Polars is about 4x faster then Pandas. Both take a very small fraction of a second. – user19077881 Commented Mar 24 at 16:50
1 Answer
Reset to default -1If you have memory issues, don't use pandas for such a simple task. If the task really just summing up the second value to groups of the first, just do:
sums = {}
with open("file.csv","r") as infile:
line = infile.readline()
while not line == "":
values = line.split(";") ## Use the appropriate separator here
lineid, linevalue = values[0],values[1]
try:
sums[lineid] += int(linevalue)
except:
sums[lineid] = int(linevalue)
line = infile.readline()
result = {"ID":[],"Value":[]}
for i in sorted(list(sums.keys())):
outdata["ID"].append(i)
outdata["Value"].append(sums[i])
If you really need to, you can then transform the dictionary result
to a pandas df. But you could also just write the result to a new CSV and be done with it.
This should work for very large files, until the sums
dictionary causes a memory overflow.