I have one use case.
Sink(S1) -> I have written a job in Spark that is sinking the data from OpenSearch to S3.
Sink(S2) -> I have another job which is sinking the data from Kafka to S3 into the same folder as S1
Both are working fine, but if I do the sinking from S1, I can see it has sunk all the data correctly, but when I am sinking from S2, it cannot sink the data into the same folder. But using a different folder can sink the data despite having the same schema and data format.
I have check the hoodie.properties file for both of them, there are a lot of properties which are saved by default and those are different also
1st image has spark content and 2nd has flink content in the S3 folder
Bothe are MERGE_ON_READ, for the S1 here is the hoodie config
hudi_options = {
"hoodie.table.name": args["HUDI_TABLE_NAME"],
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.schema.on.read.enable": "true",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.write.precombine.field": "time",
"hoodie.datasource.write.partitionpath.field": "customer_uuid,year,month",
"hoodie.datasource.write.recordkey.field": "event_id",
"hoodie.write.concurrency.mode": "optimistic_concurrency_control",
"hoodie.database.name": "default_database",
}
For S2 sink hoodie config
(" +
" 'connector' = 'hudi'," +
" 'path' = '" + S3_PATH + "'," +
" 'table.type' = 'MERGE_ON_READ'," +
" 'hoodie.table.name' = '" + HUDI_TABLE_NAME + "'," +
" 'hoodie.datasource.write.recordkey.field' = 'event_id'," +
" 'hoodie.datasource.write.partitionpath.field' = 'customer_uuid,year,month'," +
" 'hoodie.datasource.write.precombine.field' = 'time'," +
" 'hoodie.datasource.write.hive_style_partitioning' = 'true'," +
" 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control'," +
" 'hoodie.schema.on.read.enable' = 'true'," +
" 'hoodiepaction.payload.class' = '.apache.hudimon.model.DefaultHoodieRecordPayload'," +
" 'hoodie.archivelog.folder' = 'history'," +
" 'hoodie.timeline.path' = 'timeline'," +
" 'hoodie.table.base.file.format' = 'PARQUET'," +
" 'hoodie.table.metadata.partitions' = 'files'," +
" 'hoodie.table.keygenerator.type' = 'COMPLEX'" +
")";
What to do so that both can use the same S3 folder with no data loss.
I have also noticed one thing that, if I do the sinking via flink first, it will do it properly, but into the same folder if I run the glue job, it will delete all the content and then replace it with data of flink. While visa-versa is giving error.
Thanks.