I'm trying to use AWS Glue Streaming ETL job to read and write using trigger.AvailableNow with Kinesis Data Streams as I used with Kafka, but no record is processed and all the checkpoint files have inconsistent data related to startingPosition and iteratorType.
streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", ";) \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()
streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()
It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.
I'm trying to use AWS Glue Streaming ETL job to read and write using trigger.AvailableNow with Kinesis Data Streams as I used with Kafka, but no record is processed and all the checkpoint files have inconsistent data related to startingPosition and iteratorType.
streaming_df: DataFrame = spark \
.readStream \
.format("kinesis") \
.option("streamName", "oxg-cdp-cdc-stream-sbx") \
.option("endpointUrl", "https://kinesis.eu-central-1.amazonaws") \
.option("region", "eu-central-1")\
.option("startingPosition", "earliest")\
.load()
streaming_df \
.writeStream \
.option("checkpointLocation", <s3_location>") \
.trigger(availableNow=True) \
.foreachBatch(for_each_batch_funtion) \
.start() \
.awaitTermination()
It seems that availableNow doesn't work properly with Kinesis, but I can't find any oficial documentation stating that.
Share Improve this question edited Nov 28, 2024 at 14:32 Rob 15.2k30 gold badges48 silver badges73 bronze badges asked Nov 18, 2024 at 20:54 Alexandre SilvaAlexandre Silva 133 bronze badges1 Answer
Reset to default 0Here Spark github example: Spark-Kinesis
I also find the official documentation from Spark: Spark Streaming + Kinesis
Hope this can help you.