最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Data is shuffled when writing the iceberg table via Glue Pyspark for the data ingested by DMS - Stack Overflow

programmeradmin3浏览0评论

I am trying to read the CSV data ingested by the Database Migration Service(AWS) like below:

spark = SparkSession.builder \
    .config(f"spark.sql.catalog.{catalog_nm}", ".apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog_nm}.warehouse", warehouse_path) \
    .config(f"spark.sql.catalog.{catalog_nm}.catalog-impl", ".apache.iceberg.aws.glue.GlueCatalog") \
    .config(f"spark.sql.catalog.{catalog_nm}.io-impl", ".apache.iceberg.aws.s3.S3FileIO") \
    .config("spark.sql.extensions", ".apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.iceberg.handle-timestamp-without-timezone", "true") \
    .getOrCreate()

raw_df = spark.read.option("header","true").csv(f"s3://.../public/{table_name}/")

Now, I am trying to create an iceberg table using Pyspark as below:

raw_df.select("id","city","state").write.format("iceberg") \
    .mode("overwrite") \
    .saveAsTable(f"{catalog_nm}.{database_op}.{table_op}_11")

But when the data is viewed in Athena, the data in columns are shuffled like below:

Where the ID column has the data(Insert/Update flags) of the OP column generated by the DMS service. I also tried to create a schema and append the data, but I still had the same issue. I also tried to drop the "op" column, but when I printed the schema of the dataframe, the op column wasn't present.

So, what am I missing here?

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论