I have a Stream job in PySpark, which reads the CDC Feed. So far, nothing special. When the source table gets a schema change, the Feed changes its schema, too. My Stream job stops working and fails with:
com.databricks.sql.transaction.tahoe.DeltaStreamingColumnMappingSchemaIncompatibleException: Streaming read is not supported on tables with read-incompatible schema changes (e.g. rename or drop or datatype changes).
Please provide a 'schemaTrackingLocation' to enable non-additive schema evolution for Delta stream processing.
Normally, I would fix this by adding schemaTrackingLocation
parameter so that Spark can also track not-additive schema changes (i.e., column rename/type change )
The problem is that the sink is OpenSearch.
Yes, I know it: OpenSearch does not need this paramters, however I have to idea how to accept the incoming data frame when some schema change happens.
At the moment with or without the schemaTrackingLocation
, the Spark stream jobs stops working.
BTW, the data pushed to OpenSearch is filtered, thus the document mapping is always honored.
Any help is appreciated.
stream_os = (
query_push_os
.options(**w_options) <-- here mergeSchema and schemaTrackingLocation
.outputMode('append')
.format(".opensearch.spark.sql")
.options(**os_config) <-- only OpenSearch settings.
.start()
)