The schemaTrackingLocation + mergeSchema, PySpark streaming jobs, schema changes, and OpenSearch

I have a Stream job in PySpark, which reads the CDC Feed. So far, nothing special. When the source table gets a schema change, the Feed changes its schema, too. My Stream job stops working and fails with:

com.databricks.sql.transaction.tahoe.DeltaStreamingColumnMappingSchemaIncompatibleException: Streaming read is not supported on tables with read-incompatible schema changes (e.g. rename or drop or datatype changes).
Please provide a 'schemaTrackingLocation' to enable non-additive schema evolution for Delta stream processing.

Normally, I would fix this by adding schemaTrackingLocation parameter so that Spark can also track not-additive schema changes (i.e., column rename/type change )

The problem is that the sink is OpenSearch.

Yes, I know it: OpenSearch does not need this paramters, however I have to idea how to accept the incoming data frame when some schema change happens.

At the moment with or without the schemaTrackingLocation, the Spark stream jobs stops working.

BTW, the data pushed to OpenSearch is filtered, thus the document mapping is always honored.

Any help is appreciated.

 stream_os = (
   query_push_os
  .options(**w_options) <-- here mergeSchema and schemaTrackingLocation
  .outputMode('append')
  .format(".opensearch.spark.sql")
  .options(**os_config) <-- only OpenSearch settings.
  .start()
)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

The schemaTrackingLocation + mergeSchema, PySpark streaming jobs, schema changes, and OpenSearch - Stack Overflow

与本文相关的文章

评论列表(0)