I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache DataFusion, I notice a significant performance difference:
Queries run faster on the Parquet file written by Polars Queries take longer on the Parquet file written by Spark I expected similar performance since the schema and data remain unchanged. I am trying to understand why this discrepancy occurs.
Here are some details about my setup:
Spark version: 3.5.0 Polars version: 1.24.0
Parquet write options:
Spark: df.write.parquet("path") Polars: df.write_parquet("path")
I tried changing the compression for spark too but was not able to achieve the same results as the parquet from Polars.
Has anyone experienced a similar issue? What aspects of Spark's and Polars' Parquet writing might cause this performance difference? Are there specific configurations I should check when writing Parquet in either framework?
These are some configs I tried adjusting for Spark before writing too
.config("spark.sql.parquetpression.codec", "zstd")
.config("parquet.enable.dictionary", "true")
.config("parquet.dictionary.pageSize", 1048576)
.config("parquet.block.size", 4 * 1024 * 1024) // Smaller row groups (4MB) for DataFusion
.config("parquet.page.size", 128 * 1024)
.config("parquet.writer.version", "PARQUET_2_0")
.config("parquet.int96RebaseModeInWrite", "CORRECTED")
.config("spark.sql.parquet.mergeSchema", "false")
.config("parquet.column.index.enabled", "true")
.config("parquet.column.index.pageSize", "64 * 1024")
.config("parquet.statistics.enabled", "true")
.config("parquet.int64.timestats", "false")
.config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
.config("spark.sql.parquet.filterPushdown", "true")