apache spark - no record is processed and all the checkpoint files have inconsistent data

I'm trying to use AWS Glue Streaming ETL job to read and write using trigger.AvailableNow with Kinesis Data Streams as I used with Kafka, but no record is processed and all the checkpoint files have inconsistent data related to startingPosition and iteratorType.

streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", ";) \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()


streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()

It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.

streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", "https://kinesis.eu-central-1.amazonaws") \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()


streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()

It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.

Share Improve this question edited Nov 28, 2024 at 14:32 Rob 15.2k30 gold badges48 silver badges73 bronze badges asked Nov 18, 2024 at 20:54 Alexandre Silva 133 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Here Spark github example: Spark-Kinesis

I also find the official documentation from Spark: Spark Streaming + Kinesis

Hope this can help you.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

apache spark - no record is processed and all the checkpoint files have inconsistent data - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)