最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - PyIceberg with AWS Glue Creates Unwanted Nested Directories in S3 Tables - Stack Overflow

programmeradmin2浏览0评论

I'm using PyIceberg with AWS Glue REST catalog to insert data into an Iceberg table stored in S3. The data insertion works fine, but I noticed that PyIceberg creates unwanted nested directories in S3 before reaching the actual partitioned directories.

Here’s an example of what’s being created:

s3://my-bucket/data/1011/0000/0111/01111110/tenent_id=ten1/account_id=acc7/marketplace_id=UK/time_window_start_year=2021/00000-30-fec66163-f813-4002-acd3-14a59735647b.parquet

Instead of this deep nested structure, I was expecting a cleaner partitioning structure like:

s3://my-bucket/data/tenent_id=ten1/account_id=acc7/marketplace_id=UK/time_window_start_year=2021/00000-30.parquet

Setup Details:

  • Catalog Type: AWS Glue REST
  • PyIceberg Version: 0.9.0
  • Storage: S3
  • Partitioning:
    partition_spec=[
        ("tenent_id", "identity"),
        ("account_id", "identity"),
        ("marketplace_id", "identity"),
        ("time_window_start", "year"),
    ]
    
  • Code Snippet for Insertion:
    table = catalog.load_table("ams_namespace.ams_poc_table")
    table.append(pa.Table.from_pylist(fake_data, schema=schema))
    

Questions:

  1. Why is PyIceberg creating these extra nested directories (e.g., 1011/0000/0111/01111110/)?
  2. Is there a way to disable or control this behavior?
发布评论

评论列表(0)

  1. 暂无评论