Assume Dask dataframe partitioned on region
column. As more data arrive to the dataframe, more parquets are created within each region
partition. I want "concatenate" all partitions within regions with more parquets into single parquet. Assume further the data is large and therefore I do not want to manipulate the data where there is already only parquet in region. I tried various ways but I did not manage to make it working.
P.S. I hope my usage of word "partition" is not confusing - not sure how to differentiate between regions and individual parquets.
Old data:
my_df
├── region=A
│ └── part.0.parquet
├── region=B
│ ├── part.0.parquet
│ └── part.3.parquet
├── region=C
│ ├── part.0.parquet
│ └── part.3.parquet
└── region=D
└── part.3.parquet
Desired data:
my_df
├── region=A
│ └── part.0.parquet
├── region=B
│ └── part.4.parquet
├── region=C
│ └── part.4.parquet
└── region=D
└── part.3.parquet
Code to create data:
import dask.dataframe as dd
import pandas as pd
from pyarrow import fs
# create local filesystem and path to data
filesys = fs.LocalFileSystem()
PATH = "my_df"
# create dataframes
df1 = pd.DataFrame({
"region": ["A", "B", "C"],
"date": ["2025-03-21", "2025-03-22", "2025-03-23"],
"value": [100, 200, 300],
})
df2 = pd.DataFrame({
"region": ["B", "C", "D"],
"date": ["2025-03-22", "2025-03-23", "2025-03-24"],
"value": [150, 250, 350],
})
# create dask dataframes and write to parquet
ddf1 = dd.from_pandas(df1)
ddf1.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True)
ddf2 = dd.from_pandas(df2)
ddf2.to_parquet(f"{PATH}", filesystem=filesys, partition_on=["region"], append=True, ignore_divisions=True)