Python and Pandas to write large amount dataframe

A script is generating 50000 dataframe of below structure, and save them 1-by-1 to local disk. To improve efficiency, I changed the write-to format from Excel to Parquet. However, it seems not running faster.

The 50000 dataframe are then to be lopped to filter the rows when the Meet column is 4 or 5, to save them in a final txt file.

What's the better solution to above? (to have a final file only containing rows with 4 or 5 in Meet)

I'm thinking perhaps when generating the 50000 dataframe, it filters the rows (4 or 5 in Meet) instead of saving all rows in each smaller dataframe. And, instead of writing 50000 dataframe, it directly writes to the final txt file. That is, skip the step of writing each small dataframe as a small file to local disk.

Number of rows may be million. I'm not sure if a normal laptop can handle that (Win11, 16GB RAM, no internet connection).

                             Dict                  DT  Length  Meet
0  {'key_0': 45, 'key_1': 67}  2023-10-15 14:32:10      15     5
1  {'key_0': 12, 'key_1': 34}  2023-10-12 09:15:45      19     3
2  {'key_0': 56, 'key_1': 89}  2023-10-20 11:45:30      13     7
3  {'key_0': 23, 'key_1': 45}  2023-10-05 08:20:00      17     4
4  {'key_0': 78, 'key_1': 12}  2023-10-10 16:05:55      10     6

Due to the length of the code (1315 lines) and privacy, sorry that I am not able to paste the code here. I am trying to write to 1 final dataframe directly, just power accidentally lost once that it needs to rerun.

big_df = []
    ......
    - - - lines to generate df_small - - -
    df_small = df_small[df_small['Meet'].isin([4,5])]
    big_df.append(df_1)

writing_df = pd.concat(big_df, ignore_index=True)
writing_df.to_excel('final.xlsx', index=False)

The 50000 dataframe are then to be lopped to filter the rows when the Meet column is 4 or 5, to save them in a final txt file.

What's the better solution to above? (to have a final file only containing rows with 4 or 5 in Meet)

Number of rows may be million. I'm not sure if a normal laptop can handle that (Win11, 16GB RAM, no internet connection).

                             Dict                  DT  Length  Meet
0  {'key_0': 45, 'key_1': 67}  2023-10-15 14:32:10      15     5
1  {'key_0': 12, 'key_1': 34}  2023-10-12 09:15:45      19     3
2  {'key_0': 56, 'key_1': 89}  2023-10-20 11:45:30      13     7
3  {'key_0': 23, 'key_1': 45}  2023-10-05 08:20:00      17     4
4  {'key_0': 78, 'key_1': 12}  2023-10-10 16:05:55      10     6

big_df = []
    ......
    - - - lines to generate df_small - - -
    df_small = df_small[df_small['Meet'].isin([4,5])]
    big_df.append(df_1)

writing_df = pd.concat(big_df, ignore_index=True)
writing_df.to_excel('final.xlsx', index=False)

Share Improve this question edited Nov 22, 2024 at 1:30 asked Nov 21, 2024 at 6:34 Mark K 9,38815 gold badges67 silver badges129 bronze badges

1 Please show the code that constructs and "filters" the dataframes – Adon Bilivit Commented Nov 21, 2024 at 9:33
@SIGHUP, thank you for your comment. Due to the length of the code (1315 lines) and privacy, sorry that I am not able to paste the code here. I edited the question a bit. – Mark K Commented Nov 22, 2024 at 1:31
What does 50000 dataframe mean? 50k rows? 50k columns? 50k dataframes? Please edit for clarity. – Parfait Commented Nov 24, 2024 at 4:14
@Parfait, thanks for your comment. 50000 dataframe generated were of fix columns but different length. It's 50k dataframes. – Mark K Commented Nov 25, 2024 at 1:18
1 How did you obtain 50k data frames? Are you using groupby? Without more background you are asking for a Y solution and not asking about your X problem. See What is the XY problem?. Omitting 1,315 lines doesn't help us help you. – Parfait Commented Nov 25, 2024 at 23:32

| Show 2 more comments

4 Answers 4

Sorted by: Reset to default 1

Since you have a working process to generate the 50k data frames, consider adding one more step to existing looping logic to filter and save data to a growing CSV file. Specifically, turn this filtering and saving process into a defined function that appends to CSV file. Then, at the end of your main looping logic, run each data frame through it. This avoids building large objects with list.append, pd.concat, even assignment using =.

def filter_save_data(df, csv_file):
   df[df['Meet'].isin([4,5]).to_csv(
       path_or_buf = csv_file, header = False, index = False, mode = "a"
   )


for ... :
   # ... looping logic to generate data ...

   filter_save_data(df, "output.csv")

   # ... other processing with data ...

Be sure to delete the CSV file and initialize with columns before looping logic.

In lieu of appending to a CSV, consider using DataFrame.to_sql to append to a table within a database like file-level DBMS, SQLite, with a DB-API that is part of the Python standard library:

from sqlalchemy import create_engine
...


def filter_save_data(df):
   engine = create_engine('sqlite:////home/path/to/mydatabase.db') # UNIX
   engine = create_engine(r'sqlite:///C:\path\to\mydatabase.db')   # WINDOWS

   df[df['Meet'].isin([4,5]).to_sql(
       name = "mytable",
       con = engine,
       index = False,
       if_exists = "append"
   )


for ... :
   # ... looping logic to generate data ...

   filter_save_data(df)

   # ... other processing with data ...

To later read back into pandas:

import sqlite3
import pandas as pd

conn = sqlite3.conn("/path/to/mydatabase.db")

final_df = pd.read_sql_table("mytable", conn)

Conceptually I would evaluate the rows as you create them and only persist the ones you want. Of course if you need the 50k files for some other reason that would invalidate this.

I would also recommend you switch from Excel files to CSV files or really anything you can stream. I would also remove pandas from this unless there was some compelling reason such as grouping.

Conceptually here is a script that can generate an almost arbitrarily large result of only Meet 4s and 5s.

import random
import datetime
import csv

def generate_random_row():
    return {
        "Dict": {
            'key_0': random.randint(1, 99),
            'key_1': random.randint(1, 99)
        },
        "DT":  datetime.datetime.strftime(datetime.datetime.fromtimestamp(random.randint(1672531200, 1704067199)), "%Y-%m-%d %H:%M:%S"),
        "Length": random.randint(10, 20),
        "Meet":  random.randint(1, 10)
    }

rows_to_generate = 1_000_000
with open("final.csv", "w", newline="") as file_out:
    writer = csv.DictWriter(file_out, fieldnames=generate_random_row().keys())
    writer.writeheader()
    for index in range(rows_to_generate):
        row = generate_random_row()

        if row["Meet"] in [4,5]:
            writer.writerow(row)

        if not index % 1_000:
            print(f"processed: {index} candidate rows", end="\r", flush=True)
    print(f"processed: {index} candidate rows")

If you must use the 50k files strategy and can write them as csv files which excel can still read let me know and I'll make a slight modification to this and use glob.glob() to open and stream process them.

Here is a version that processes all CSV files in the pattern "starting_*.csv". Note that glob.glob() can find nested files of this pattern as well but all the test files I created happen to be in this folder I am testing with.

import glob

with open("final.csv", "w", newline="") as file_out:
    headers_written = False
    for starting_file in glob.glob("starting_*.csv"):
        print(f"Processing file: {starting_file}", end="\r", flush=True)
        with open(starting_file, "r") as file_in:
            headers = next(file_in) 
            if not headers_written:
                file_out.write(headers)
                headers_written = True
            for row in file_in:
                columns = row.split(",")
                if columns[-1].strip() in "45":
                    file_out.write(row)
        print()

Without looking at the code, this is what I recommend.

Put that frames in a generator.

If you use a function to generate the data, use yield instead of return. That means that data can be consumed on the go without consuming memory.

I've tried to skip the step of writing each small dataframe as a small file to local disk.

It filtered the rows (4 or 5 in Meet) instead of saving all rows in each smaller dataframe, and tried directly wrote them to a final txt file.

The disadvantage was that, I am not able to monitor the progressing as it's running "behind". Taken even longer hours than the old method, I aborted the test.

Another try was to only save the small dataframe when there are filtered rows (4 or 5 in Meet), and save them into small Parquests. Taken a further step to contact the Parquests. This is the most efficient way I can get, even though it still takes long.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Python and Pandas to write large amount dataframe - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)