I read data from the same parquet files multiple times using polars (polars rust engine and pyarrow) and using pandas pyarrow backend (not fastparquet as it was very slow), see below code.
All the parquetfiles contain a column called "backscat" where every element is a list of 1.5 Million floating points (it is a time series).
If the hash of the dataframe is different, my script checks the differences in the values of the "backscat" column and if there are values outside "rtol = 1e-7, atol = 1e-10" saved them as "violations".
I let that script run on two workstations with an identical cpu, with the identical virtual environment (poetry) and python version (3.12.8) and polars Version: 1.20.0, Ubuntu 22.04. and they also have exactly the same memory configuration.
On workstation A my analysis show that there are significant differences between file reads on the same file. For example, sorting for the largest absolute differences, for the polars rust backend, the 5 biggest values are:
name_test | data_1 | data_2 | abs_diff | abs_relative_diff |
---|---|---|---|---|
polars_rust | -3.041666 | -38.666656 | 35.62499 | 11.712328 |
polars_rust | -38.666656 | -3.041666 | 35.62499 | 0.921336 |
polars_rust | -2.927914 | -27.423315 | 24.495401 | 8.36616 |
polars_rust | -27.423315 | -2.927914 | 24.495401 | 0.893233 |
polars_rust | -2.927876 | -27.42301 | 24.495134 | 8.366178 |