Why is that pola.rs rust-code considerably slower than the python version?

I am currently comparing different DataFrame based libs in python and rust. Of course I also check pola.rs, as that lib can be used from within both programming languages.

I tried to write the same code in both languages, as much as I understood and could. And at first, with first test-data (CSV files with 2k and 1k lines), the performance of both is mainly the same.

Then I increased the data by copying the same content additionally 9 times into the same files. Creating 20k and 10k of lines. The python version is slightly slower with the bigger files. The rust version takes 8 times the time with the bigger files.

I don't understand why. Hopefully someone can give me a pointer where I use the rust lib incorrectly and thus tanking the performance.

First, for comparison the python code:

import polars as pl

LOCATION = "London Westminster"

def main():
    no2_quality = pl.read_csv("data/air_quality_no2_long.csv", try_parse_dates=True)
    no2_quality = no2_quality.filter(pl.col("location") == LOCATION)
    no2_grouped = no2_quality.group_by(
        (pl.col("date.utc").dt.date()).alias("date"),
        (pl.col("date.utc").dt.hour()).alias("hour"),
        maintain_order=True,
    ).sum()

    pm25_quality = pl.read_csv("data/air_quality_pm25_long.csv", try_parse_dates=True)
    pm25_quality = pm25_quality.filter(pl.col("location") == LOCATION)
    pm25_grouped = pm25_quality.group_by(
        (pl.col("date.utc").dt.date()).alias("date"),
        (pl.col("date.utc").dt.hour()).alias("hour"),
        maintain_order=True,
    ).sum().sort(["date", "hour"])

    by_day = no2_grouped.group_by(
        (pl.col("date")),
    ).sum()
    by_day = by_day.sort("value", descending=True)
    top = by_day.head(3).select(
        pl.col("date"),
        pl.col("value"),
    )
    print(top)

    top_hourly = (top.join(no2_grouped, on="date", how="left", suffix="_no2")
                  .join(pm25_grouped, on=["date", "hour"], how="left", suffix="_pm25"))
    top_hourly = top_hourly.group_by(
        pl.col("hour"),
    ).mean().sort("hour")
    print(top_hourly.select(
        pl.col("hour"),
        pl.col("value_no2"),
        pl.col("value_pm25"),
    ))

    bottom = by_day.tail(3).select(
        pl.col("date"),
        pl.col("value"),
    )
    print(bottom)

    bottom_hourly = (bottom.join(no2_grouped, on="date", how="left", suffix="_no2")
                  .join(pm25_grouped, on=["date", "hour"], how="left", suffix="_pm25")).sort(["date", "hour"])
    bottom_hourly = bottom_hourly.group_by(
        pl.col("hour"),
    ).mean().sort("hour")
    print(bottom_hourly.select(
        pl.col("hour"),
        pl.col("value_no2"),
        pl.col("value_pm25"),
    ))


if __name__ == "__main__":
    main()

Now the rust code:

use polars::prelude::*;

const LOCATION: &str = "London Westminster";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let no2_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_no2_long.csv".into()))?
        .finish()?;
    let no2_quality = no2_quality.lazy().filter(col("location").eq(lit(LOCATION)));
    let no2_grouped = no2_quality
        .clone()
        .group_by_stable([
            col("date.utc").dt().date().alias("date"),
            col("date.utc").dt().hour().alias("hour"),
        ])
        .agg([sum("value")]);

    let pm25_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_pm25_long.csv".into()))?
        .finish()?;
    let pm25_quality = pm25_quality
        .lazy()
        .filter(col("location").eq(lit(LOCATION)));
    let pm25_grouped = pm25_quality
        .group_by_stable([
            col("date.utc").dt().date().alias("date"),
            col("date.utc").dt().hour().alias("hour"),
        ])
        .agg([sum("value")]);

    let by_day = no2_quality
        .group_by([col("date.utc").dt().date().alias("date")])
        .agg([sum("value")]);
    let by_day = by_day.sort(
        ["value"],
        SortMultipleOptions::default().with_order_descending(true),
    );
    let top = by_day.clone().limit(3).collect()?; //
    println!("{}", top);

    let top_hourly = top.lazy().join(
        no2_grouped.clone(),
        [col("date")],
        [col("date")],
        JoinArgs::default().with_suffix(Some("_no2".into())),
    );
    let top_hourly = top_hourly.join(
        pm25_grouped.clone(),
        [col("date"), col("hour")],
        [col("date"), col("hour")],
        JoinArgs::default().with_suffix(Some("_pm25".into())),
    );
    let top_hourly = top_hourly
        .group_by([col("hour")])
        .agg([mean("value_no2"), mean("value_pm25")])
        .sort(["hour"], SortMultipleOptions::default())
        .collect()?;
    println!("{}", top_hourly);

    let bottom = by_day.tail(3).collect()?;
    println!("{}", bottom);

    let bottom_hourly = bottom.lazy().join(
        no2_grouped.clone(),
        [col("date")],
        [col("date")],
        JoinArgs::default().with_suffix(Some("_no2".into())),
    );
    let bottom_hourly = bottom_hourly.join(
        pm25_grouped.clone(),
        [col("date"), col("hour")],
        [col("date"), col("hour")],
        JoinArgs::default().with_suffix(Some("_pm25".into())),
    );
    let bottom_hourly = bottom_hourly
        .group_by([col("hour")])
        .agg([mean("value_no2"), mean("value_pm25")])
        .sort(["hour"], SortMultipleOptions::default())
        .collect()?;
    println!("{}", bottom_hourly);

    Ok(())
}

The input files are from pandas example:

.csv
.csv

It's not just an optimization problem, as I run a release-build (cargo build --release) with both file sizes. To compare the python to the rust version and the small file sizes to the bigger I use the sh-integrated tool time. It runs on the same hardware (Macbook M3 Pro) on the same OS.

2k / 1k:

================
CPU 131%
user    0.116
system  0.032
total   0.113

20k / 10k:

================
CPU 115%
user    0.934
system  0.038
total   0.842

And a different DataFrame lib in rust does not suffer such a performance penalty with bigger files. So I think it's not per se a language topic.

UPDATEs based on comments/feedbacks

collect calls

First I reduced the number of collect calls to only 4, right before the individual println-calls. (code above updated to reflect that change) Reducing the intermediate variables does not have any additional effect.

================
CPU 116%
user    0.967
system  0.046
total   0.870

lto=fat

I added lto=fat in the toml:

[profile.release]
lto = "fat"

Effect:

================
CPU 115%
user    1.031
system  0.043
total   0.925

consistently slightly higher then without. Do I need to combine that with some other optimization?

lto=fat & codegenunit=1

[profile.release]
codegen-units = 1
debug = false
lto = "fat"

Effect:

================
CPU 116%
user    0.931
system  0.047
total   0.842

jemalloc

without lto=fat:

================
CPU 116%
user    0.983
system  0.049
total   0.889

with lto=fat:

================
CPU 116%
user    1.033
system  0.055
total   0.935

nightly build

using nightly build installed via dustup and

[dependencies]
polars = { version = "0.46.0", features = ["lazy", "nightly"] }

Still had no major impact:

================
CPU 116%
user    0.925
system  0.046
total   0.836

Wrong Direction?

As none of the hints from the comments had any measurable improvements I am beginning to suspect the problem is not to be found in processing the data, but in reading it.

So I made following test version:

use jemallocator::Jemalloc;
use polars::prelude::*;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

const LOCATION: &str = "London Westminster";

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let _no2_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_no2_long.csv".into()))?
        .finish()?;

    let _pm25_quality = CsvReadOptions::default()
        .with_infer_schema_length(None)
        .with_has_header(true)
        .with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
        .try_into_reader_with_file_path(Some("data/air_quality_pm25_long.csv".into()))?
        .finish()?;

    Ok(())
}

Release-Build it without any features and let it run:

================
CPU 111%
user    0.925
system  0.016
total   0.842

So I currently tend to see that as a bug in the csv reading.

Thanks for all your responses and sorry to not have seen that earlier

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Why is that pola.rs rust-code considerably slower than the python version? - Stack Overflow