Why does a Parquet file written with Polars query faster than one written with Spark?

I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache DataFusion, I notice a significant performance difference:

Queries run faster on the Parquet file written by Polars Queries take longer on the Parquet file written by Spark I expected similar performance since the schema and data remain unchanged. I am trying to understand why this discrepancy occurs.

Here are some details about my setup:

Spark version: 3.5.0 Polars version: 1.24.0

Parquet write options:

Spark: df.write.parquet("path") Polars: df.write_parquet("path")

I tried changing the compression for spark too but was not able to achieve the same results as the parquet from Polars.

Has anyone experienced a similar issue? What aspects of Spark's and Polars' Parquet writing might cause this performance difference? Are there specific configurations I should check when writing Parquet in either framework?

These are some configs I tried adjusting for Spark before writing too

 .config("spark.sql.parquetpression.codec", "zstd")  
  .config("parquet.enable.dictionary", "true")  
  .config("parquet.dictionary.pageSize", 1048576) 
  .config("parquet.block.size", 4 * 1024 * 1024)  // Smaller row groups (4MB) for DataFusion 
  .config("parquet.page.size", 128 * 1024)  
  .config("parquet.writer.version", "PARQUET_2_0")  
  .config("parquet.int96RebaseModeInWrite", "CORRECTED")  
  .config("spark.sql.parquet.mergeSchema", "false")  
  .config("parquet.column.index.enabled", "true")  
  .config("parquet.column.index.pageSize", "64 * 1024") 
  .config("parquet.statistics.enabled", "true")  
  .config("parquet.int64.timestats", "false")  
  .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  
  .config("spark.sql.parquet.filterPushdown", "true")

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Why does a Parquet file written with Polars query faster than one written with Spark? - Stack Overflow

与本文相关的文章

评论列表(0)