最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Why does a Parquet file written with Polars query faster than one written with Spark? - Stack Overflow

programmeradmin2浏览0评论

I am writing Parquet files using two different frameworks—Apache Spark (Scala) and Polars (Python)—with the same schema and data. However, when I query the resulting Parquet files using Apache DataFusion, I notice a significant performance difference:

Queries run faster on the Parquet file written by Polars Queries take longer on the Parquet file written by Spark I expected similar performance since the schema and data remain unchanged. I am trying to understand why this discrepancy occurs.

Here are some details about my setup:

Spark version: 3.5.0 Polars version: 1.24.0

Parquet write options:

Spark: df.write.parquet("path") Polars: df.write_parquet("path")

I tried changing the compression for spark too but was not able to achieve the same results as the parquet from Polars.

Has anyone experienced a similar issue? What aspects of Spark's and Polars' Parquet writing might cause this performance difference? Are there specific configurations I should check when writing Parquet in either framework?

These are some configs I tried adjusting for Spark before writing too

 .config("spark.sql.parquetpression.codec", "zstd")  
  .config("parquet.enable.dictionary", "true")  
  .config("parquet.dictionary.pageSize", 1048576) 
  .config("parquet.block.size", 4 * 1024 * 1024)  // Smaller row groups (4MB) for DataFusion 
  .config("parquet.page.size", 128 * 1024)  
  .config("parquet.writer.version", "PARQUET_2_0")  
  .config("parquet.int96RebaseModeInWrite", "CORRECTED")  
  .config("spark.sql.parquet.mergeSchema", "false")  
  .config("parquet.column.index.enabled", "true")  
  .config("parquet.column.index.pageSize", "64 * 1024") 
  .config("parquet.statistics.enabled", "true")  
  .config("parquet.int64.timestats", "false")  
  .config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")  
  .config("spark.sql.parquet.filterPushdown", "true") 
发布评论

评论列表(0)

  1. 暂无评论