I'm using PyIceberg with AWS Glue REST catalog to insert data into an Iceberg table stored in S3. The data insertion works fine, but I noticed that PyIceberg creates unwanted nested directories in S3 before reaching the actual partitioned directories.
Here’s an example of what’s being created:
s3://my-bucket/data/1011/0000/0111/01111110/tenent_id=ten1/account_id=acc7/marketplace_id=UK/time_window_start_year=2021/00000-30-fec66163-f813-4002-acd3-14a59735647b.parquet
Instead of this deep nested structure, I was expecting a cleaner partitioning structure like:
s3://my-bucket/data/tenent_id=ten1/account_id=acc7/marketplace_id=UK/time_window_start_year=2021/00000-30.parquet
Setup Details:
- Catalog Type: AWS Glue REST
- PyIceberg Version:
0.9.0
- Storage: S3
- Partitioning:
partition_spec=[ ("tenent_id", "identity"), ("account_id", "identity"), ("marketplace_id", "identity"), ("time_window_start", "year"), ]
- Code Snippet for Insertion:
table = catalog.load_table("ams_namespace.ams_poc_table") table.append(pa.Table.from_pylist(fake_data, schema=schema))
Questions:
- Why is PyIceberg creating these extra nested directories (e.g.,
1011/0000/0111/01111110/
)? - Is there a way to disable or control this behavior?