A script is generating 50000 dataframe of below structure, and save them 1-by-1 to local disk. To improve efficiency, I changed the write-to format from Excel to Parquet. However, it seems not running faster.
The 50000 dataframe are then to be lopped to filter the rows when the Meet column is 4 or 5, to save them in a final txt file.
What's the better solution to above? (to have a final file only containing rows with 4 or 5 in Meet)
I'm thinking perhaps when generating the 50000 dataframe, it filters the rows (4 or 5 in Meet) instead of saving all rows in each smaller dataframe. And, instead of writing 50000 dataframe, it directly writes to the final txt file. That is, skip the step of writing each small dataframe as a small file to local disk.
Number of rows may be million. I'm not sure if a normal laptop can handle that (Win11, 16GB RAM, no internet connection).
Dict DT Length Meet
0 {'key_0': 45, 'key_1': 67} 2023-10-15 14:32:10 15 5
1 {'key_0': 12, 'key_1': 34} 2023-10-12 09:15:45 19 3
2 {'key_0': 56, 'key_1': 89} 2023-10-20 11:45:30 13 7
3 {'key_0': 23, 'key_1': 45} 2023-10-05 08:20:00 17 4
4 {'key_0': 78, 'key_1': 12} 2023-10-10 16:05:55 10 6
Due to the length of the code (1315 lines) and privacy, sorry that I am not able to paste the code here. I am trying to write to 1 final dataframe directly, just power accidentally lost once that it needs to rerun.
big_df = []
......
- - - lines to generate df_small - - -
df_small = df_small[df_small['Meet'].isin([4,5])]
big_df.append(df_1)
writing_df = pd.concat(big_df, ignore_index=True)
writing_df.to_excel('final.xlsx', index=False)
A script is generating 50000 dataframe of below structure, and save them 1-by-1 to local disk. To improve efficiency, I changed the write-to format from Excel to Parquet. However, it seems not running faster.
The 50000 dataframe are then to be lopped to filter the rows when the Meet column is 4 or 5, to save them in a final txt file.
What's the better solution to above? (to have a final file only containing rows with 4 or 5 in Meet)
I'm thinking perhaps when generating the 50000 dataframe, it filters the rows (4 or 5 in Meet) instead of saving all rows in each smaller dataframe. And, instead of writing 50000 dataframe, it directly writes to the final txt file. That is, skip the step of writing each small dataframe as a small file to local disk.
Number of rows may be million. I'm not sure if a normal laptop can handle that (Win11, 16GB RAM, no internet connection).
Dict DT Length Meet
0 {'key_0': 45, 'key_1': 67} 2023-10-15 14:32:10 15 5
1 {'key_0': 12, 'key_1': 34} 2023-10-12 09:15:45 19 3
2 {'key_0': 56, 'key_1': 89} 2023-10-20 11:45:30 13 7
3 {'key_0': 23, 'key_1': 45} 2023-10-05 08:20:00 17 4
4 {'key_0': 78, 'key_1': 12} 2023-10-10 16:05:55 10 6
Due to the length of the code (1315 lines) and privacy, sorry that I am not able to paste the code here. I am trying to write to 1 final dataframe directly, just power accidentally lost once that it needs to rerun.
big_df = []
......
- - - lines to generate df_small - - -
df_small = df_small[df_small['Meet'].isin([4,5])]
big_df.append(df_1)
writing_df = pd.concat(big_df, ignore_index=True)
writing_df.to_excel('final.xlsx', index=False)
Share
Improve this question
edited Nov 22, 2024 at 1:30
Mark K
asked Nov 21, 2024 at 6:34
Mark KMark K
9,38815 gold badges67 silver badges129 bronze badges
7
|
Show 2 more comments
4 Answers
Reset to default 1Since you have a working process to generate the 50k data frames, consider adding one more step to existing looping logic to filter and save data to a growing CSV file. Specifically, turn this filtering and saving process into a defined function that appends to CSV file. Then, at the end of your main looping logic, run each data frame through it. This avoids building large objects with list.append
, pd.concat
, even assignment using =
.
def filter_save_data(df, csv_file):
df[df['Meet'].isin([4,5]).to_csv(
path_or_buf = csv_file, header = False, index = False, mode = "a"
)
for ... :
# ... looping logic to generate data ...
filter_save_data(df, "output.csv")
# ... other processing with data ...
Be sure to delete the CSV file and initialize with columns before looping logic.
In lieu of appending to a CSV, consider using DataFrame.to_sql
to append to a table within a database like file-level DBMS, SQLite, with a DB-API that is part of the Python standard library:
from sqlalchemy import create_engine
...
def filter_save_data(df):
engine = create_engine('sqlite:////home/path/to/mydatabase.db') # UNIX
engine = create_engine(r'sqlite:///C:\path\to\mydatabase.db') # WINDOWS
df[df['Meet'].isin([4,5]).to_sql(
name = "mytable",
con = engine,
index = False,
if_exists = "append"
)
for ... :
# ... looping logic to generate data ...
filter_save_data(df)
# ... other processing with data ...
To later read back into pandas:
import sqlite3
import pandas as pd
conn = sqlite3.conn("/path/to/mydatabase.db")
final_df = pd.read_sql_table("mytable", conn)
Conceptually I would evaluate the rows as you create them and only persist the ones you want. Of course if you need the 50k files for some other reason that would invalidate this.
I would also recommend you switch from Excel files to CSV files or really anything you can stream. I would also remove pandas from this unless there was some compelling reason such as grouping.
Conceptually here is a script that can generate an almost arbitrarily large result of only Meet 4s and 5s.
import random
import datetime
import csv
def generate_random_row():
return {
"Dict": {
'key_0': random.randint(1, 99),
'key_1': random.randint(1, 99)
},
"DT": datetime.datetime.strftime(datetime.datetime.fromtimestamp(random.randint(1672531200, 1704067199)), "%Y-%m-%d %H:%M:%S"),
"Length": random.randint(10, 20),
"Meet": random.randint(1, 10)
}
rows_to_generate = 1_000_000
with open("final.csv", "w", newline="") as file_out:
writer = csv.DictWriter(file_out, fieldnames=generate_random_row().keys())
writer.writeheader()
for index in range(rows_to_generate):
row = generate_random_row()
if row["Meet"] in [4,5]:
writer.writerow(row)
if not index % 1_000:
print(f"processed: {index} candidate rows", end="\r", flush=True)
print(f"processed: {index} candidate rows")
If you must use the 50k files strategy and can write them as csv files which excel can still read let me know and I'll make a slight modification to this and use glob.glob()
to open and stream process them.
Here is a version that processes all CSV files in the pattern "starting_*.csv". Note that glob.glob()
can find nested files of this pattern as well but all the test files I created happen to be in this folder I am testing with.
import glob
with open("final.csv", "w", newline="") as file_out:
headers_written = False
for starting_file in glob.glob("starting_*.csv"):
print(f"Processing file: {starting_file}", end="\r", flush=True)
with open(starting_file, "r") as file_in:
headers = next(file_in)
if not headers_written:
file_out.write(headers)
headers_written = True
for row in file_in:
columns = row.split(",")
if columns[-1].strip() in "45":
file_out.write(row)
print()
Without looking at the code, this is what I recommend.
Put that frames in a generator.
If you use a function to generate the data, use yield instead of return. That means that data can be consumed on the go without consuming memory.
I've tried to skip the step of writing each small dataframe as a small file to local disk.
It filtered the rows (4 or 5 in Meet) instead of saving all rows in each smaller dataframe, and tried directly wrote them to a final txt file.
The disadvantage was that, I am not able to monitor the progressing as it's running "behind". Taken even longer hours than the old method, I aborted the test.
Another try was to only save the small dataframe when there are filtered rows (4 or 5 in Meet), and save them into small Parquests. Taken a further step to contact the Parquests. This is the most efficient way I can get, even though it still takes long.
groupby
? Without more background you are asking for a Y solution and not asking about your X problem. See What is the XY problem?. Omitting 1,315 lines doesn't help us help you. – Parfait Commented Nov 25, 2024 at 23:32