I'm using Deltalake version -0.17.0. here are steps, we do-
- Read in the DeltaTable from existing S3 location. dt = DeltaTable("s3://mylocation/")
- Converted it to pyarrow table. arrow_table = dt.to_pyarrow_table()
- Filtered the arrow table and selected specific columns of interest
- Converted arrow table to pandas data frame. df = arrow_table.to_pandas()
- Writing panda dataframe back to existing new delta table. Table is empty at this point and has schema defined with non-nullable fields.
- write_deltalake("s3://test_sample_process/", df, mode="overwrite"). also tried it with schema_mode="overwrite"
Error we get is -
raise ValueError(
ValueError: Schema of data does not match table schema
Data schema:
namespace: string
ki_record_name: string
wk_center: string
kt_config: string
kt_parameters: string
mi_updated_at: timestamp[us, tz=UTC]
mi_updated_by: string
Table Schema:
namespace: string
ki_record_name: string
wk_center: string not null
kt_config: string
kt_parameters: string
mi_updated_at: timestamp[us, tz=UTC] not null
-- field metadata --
comment: '"The time this record was updated"'
mi_updated_by: string not null
-- field metadata --
comment: '"The process that updated this record"'
Verified the data frame we are trying to write that it does NOT contains any null values. It has only 2 rows, so could easily do visual inspection. Also posted the same on delta table github, but did not receive any helpful suggestions. The delta table uses pyarrow engine by default in current version. The recommendation was to migrate off it. We could try that, but it should work in the current version that supports the pyarrow engine.The same code works, when drop the schema.At that point, Delta table creates schema with all nullable fields. I want to enforce/use non-nullable fields and not able to understand why this is failing.