python - Values differ on multiple reads from parquet files using polars read_parquet but not with pandas read

I read data from the same parquet files multiple times using polars (polars rust engine and pyarrow) and using pandas pyarrow backend (not fastparquet as it was very slow), see below code.

All the parquetfiles contain a column called "backscat" where every element is a list of 1.5 Million floating points (it is a time series).

If the hash of the dataframe is different, my script checks the differences in the values of the "backscat" column and if there are values outside "rtol = 1e-7, atol = 1e-10" saved them as "violations".

I let that script run on two workstations with an identical cpu, with the identical virtual environment (poetry) and python version (3.12.8) and polars Version: 1.20.0, Ubuntu 22.04. and they also have exactly the same memory configuration.

On workstation A my analysis show that there are significant differences between file reads on the same file. For example, sorting for the largest absolute differences, for the polars rust backend, the 5 biggest values are:

name_test	data_1	data_2	abs_diff	abs_relative_diff
polars_rust	-3.041666	-38.666656	35.62499	11.712328
polars_rust	-38.666656	-3.041666	35.62499	0.921336
polars_rust	-2.927914	-27.423315	24.495401	8.36616
polars_rust	-27.423315	-2.927914	24.495401	0.893233
polars_rust	-2.927876	-27.42301	24.495134	8.366178

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Values differ on multiple reads from parquet files using polars read_parquet but not with pandas read_parquet by workst

与本文相关的文章

评论列表(0)