最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pandas - In Python, what is the fastest way to append numerical tabular data to a data file? - Stack Overflow

programmeradmin0浏览0评论

I am trying to write a script that appends new rows of a table to an existing file of tabular data, so that not all progress is lost if an error is encountered. Here are seven different ways to do this in Python:

  • Write the data to a text file (a module-free method).
  • Write the data to a CSV file using writerows of the csv module.
  • Write the data to a database using sqlite3 (or another database library).
  • Write the data to a CSV file using polars.
  • Structure the data as a pandas DataFrame and pickle it.
  • Structure the data as a polars DataFrame and pickle it.
  • Structure the data as a pandas DataFrame and save it as a Parquet file using fastparquet.

The following code measures a runtime of each of these options, excluding the time required to build the data, header, and any single-row DataFrames. (I have tried to make this code as concise as possible, while also keeping it readable.)

import numpy as np
from time import perf_counter as pc
import csv
import sqlite3
import pandas as pd
import polars as pl
import pickle as pkl
import fastparquet as fp

header = list('ABCDE')
n_rows = 10**3
data = np.random.rand(n_rows, 5)
df_pd = pd.DataFrame(data, columns=header)
df_pl_header = pl.DataFrame(schema=header)
df_pl = pl.DataFrame(data, schema=header)

print('----------------------------------------------------------------------')
print('[no module]')
fname = 'file0.txt'

pc0 = pc()
with open(fname, mode='w') as f:
    f.write('\t'.join(header) + '\n')
    f.flush()
    for k in range(n_rows):
        f.write('\t'.join(map(str, data[k])) + '\n')
        f.flush()
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('csv')
fname = 'file1.csv'

pc0 = pc()
with open(fname, mode='w', newline='') as f:
    csvwriter = csv.writer(f)
    csvwriter.writerow(header)
    f.flush()
    for row in data:
        csvwriter.writerow(row)
        f.flush()
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('sqlite3')
fname = 'file2.db'

pc0 = pc()
conn = sqlite3.connect(fname)
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS data_table')
cur.execute('CREATE TABLE IF NOT EXISTS data_table (idx integer PRIMARY KEY, A real, B real, C real, D real, E real)')
for k in range(n_rows):
    cur.execute('INSERT INTO data_table VALUES (?, ?, ?, ?, ?, ?)', [k, *data[k]])
connmit()
conn.close()
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('polars')
fname = 'file3.csv'

pc0 = pc()
with open(fname, mode='w', encoding='utf8') as f:
    df_pl_header.write_csv(f)
    for k in range(n_rows):
        df_pl[k].write_csv(f, include_header=False)
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('pandas, pickle')
fname = 'file4.pkl'

pc0 = pc()
df_new = df_pd.iloc[[0]]
df_new.to_pickle(fname)
for k in range(1, n_rows):
    df_new = pd.concat([df_new, df_pd.iloc[[k]]], ignore_index=True)
    df_new.to_pickle(fname)
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('polars, pickle')
fname = 'file5.pkl'

pc0 = pc()
df_new = df_pl[0]
df_ser = pkl.dumps(df_new)
with open(fname, mode='wb') as f:
    f.write(df_ser)
for k in range(1, n_rows):
    df_new = df_new.vstack(df_pl[k])
    df_ser = pkl.dumps(df_new)
    with open(fname, mode='wb') as f:
        f.write(df_ser)
print(f'To write data to file:  {pc()-pc0} sec')

print('----------------------------------------------------------------------')
print('pandas, fastparquet')
fname = 'file6.parquet'

pc0 = pc()
fp.write(fname, df_pd.iloc[[0]])
for k in range(1, n_rows):
    fp.write(fname, df_pd.iloc[[k]], append=True)
print(f'To write data to file:  {pc()-pc0} sec')
print('----------------------------------------------------------------------')

On my machine, running this script yields the following times.

----------------------------------------------------------------------
[no module]
To write data to file:  0.01229649999004323 sec
----------------------------------------------------------------------
csv
To write data to file:  0.011919600001419894 sec
----------------------------------------------------------------------
sqlite3
To write data to file:  0.02571699999680277 sec
----------------------------------------------------------------------
polars
To write data to file:  0.12898200000927318 sec
----------------------------------------------------------------------
pandas, pickle
To write data to file:  0.6281701999978395 sec
----------------------------------------------------------------------
polars, pickle
To write data to file:  1.4392062999977497 sec
----------------------------------------------------------------------
pandas, fastparquet
To write data to file:  33.678610299990396 sec
----------------------------------------------------------------------

So it appears that the first three approaches are much faster than the other four alternatives for this specific task and test. When I increase the size of the table (both in n_rows and number of columns), sqlite3 generally pulls ahead as the fastest option.

But am I missing faster ways of implementing the four slower alternatives or other faster ways of appending numerical data to a tabular file besides those listed?

(I am posting this question in part because I and others I know have encountered this question several times, and we do not find many posted attempts at speed comparisons between the numerous available methods, which are discussed mostly separately online.)

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论