amazon s3 - Sinking to Hudi Table by using Spark and Flink together into the same S3 folder

I have one use case.

Sink(S1) -> I have written a job in Spark that is sinking the data from OpenSearch to S3.

Sink(S2) -> I have another job which is sinking the data from Kafka to S3 into the same folder as S1

Both are working fine, but if I do the sinking from S1, I can see it has sunk all the data correctly, but when I am sinking from S2, it cannot sink the data into the same folder. But using a different folder can sink the data despite having the same schema and data format.

I have check the hoodie.properties file for both of them, there are a lot of properties which are saved by default and those are different also

1st image has spark content and 2nd has flink content in the S3 folder

Bothe are MERGE_ON_READ, for the S1 here is the hoodie config

hudi_options = {
    "hoodie.table.name": args["HUDI_TABLE_NAME"],
    "hoodie.datasource.write.table.type": "MERGE_ON_READ",
    "hoodie.schema.on.read.enable": "true",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.datasource.write.precombine.field": "time",
    "hoodie.datasource.write.partitionpath.field": "customer_uuid,year,month",
    "hoodie.datasource.write.recordkey.field": "event_id",
    "hoodie.write.concurrency.mode": "optimistic_concurrency_control",
    "hoodie.database.name": "default_database",
}

For S2 sink hoodie config

(" +
            "  'connector' = 'hudi'," +
            "  'path' = '" + S3_PATH + "'," +
            "  'table.type' = 'MERGE_ON_READ'," +
            "  'hoodie.table.name' = '" + HUDI_TABLE_NAME + "'," +
            "  'hoodie.datasource.write.recordkey.field' = 'event_id'," +
            "  'hoodie.datasource.write.partitionpath.field' = 'customer_uuid,year,month'," +
            "  'hoodie.datasource.write.precombine.field' = 'time'," +
            "  'hoodie.datasource.write.hive_style_partitioning' = 'true'," +
            "  'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control'," +
            "  'hoodie.schema.on.read.enable' = 'true'," +
            "  'hoodiepaction.payload.class' = '.apache.hudimon.model.DefaultHoodieRecordPayload'," +
            "  'hoodie.archivelog.folder' = 'history'," +
            "  'hoodie.timeline.path' = 'timeline'," +
            "  'hoodie.table.base.file.format' = 'PARQUET'," +
            "  'hoodie.table.metadata.partitions' = 'files'," +
            "  'hoodie.table.keygenerator.type' = 'COMPLEX'" +
            ")";

What to do so that both can use the same S3 folder with no data loss.

I have also noticed one thing that, if I do the sinking via flink first, it will do it properly, but into the same folder if I run the glue job, it will delete all the content and then replace it with data of flink. While visa-versa is giving error.

Thanks.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

amazon s3 - Sinking to Hudi Table by using Spark and Flink together into the same S3 folder - Stack Overflow

与本文相关的文章

评论列表(0)