最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Values differ on multiple reads from parquet files using polars read_parquet but not with pandas read_parquet by workst

programmeradmin0浏览0评论

I read data from the same parquet files multiple times using polars (polars rust engine and pyarrow) and using pandas pyarrow backend (not fastparquet as it was very slow), see below code.

All the parquetfiles contain a column called "backscat" where every element is a list of 1.5 Million floating points (it is a time series).

If the hash of the dataframe is different, my script checks the differences in the values of the "backscat" column and if there are values outside "rtol = 1e-7, atol = 1e-10" saved them as "violations".

I let that script run on two workstations with an identical cpu, with the identical virtual environment (poetry) and python version (3.12.8) and polars Version: 1.20.0, Ubuntu 22.04. and they also have exactly the same memory configuration.

On workstation A my analysis show that there are significant differences between file reads on the same file. For example, sorting for the largest absolute differences, for the polars rust backend, the 5 biggest values are:

name_test data_1 data_2 abs_diff abs_relative_diff
polars_rust -3.041666 -38.666656 35.62499 11.712328
polars_rust -38.666656 -3.041666 35.62499 0.921336
polars_rust -2.927914 -27.423315 24.495401 8.36616
polars_rust -27.423315 -2.927914 24.495401 0.893233
polars_rust -2.927876 -27.42301 24.495134 8.366178

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论