I am currently comparing different DataFrame based libs in python and rust. Of course I also check pola.rs, as that lib can be used from within both programming languages.
I tried to write the same code in both languages, as much as I understood and could. And at first, with first test-data (CSV files with 2k and 1k lines), the performance of both is mainly the same.
Then I increased the data by copying the same content additionally 9 times into the same files. Creating 20k and 10k of lines. The python version is slightly slower with the bigger files. The rust version takes 8 times the time with the bigger files.
I don't understand why. Hopefully someone can give me a pointer where I use the rust lib incorrectly and thus tanking the performance.
First, for comparison the python code:
import polars as pl
LOCATION = "London Westminster"
def main():
no2_quality = pl.read_csv("data/air_quality_no2_long.csv", try_parse_dates=True)
no2_quality = no2_quality.filter(pl.col("location") == LOCATION)
no2_grouped = no2_quality.group_by(
(pl.col("date.utc").dt.date()).alias("date"),
(pl.col("date.utc").dt.hour()).alias("hour"),
maintain_order=True,
).sum()
pm25_quality = pl.read_csv("data/air_quality_pm25_long.csv", try_parse_dates=True)
pm25_quality = pm25_quality.filter(pl.col("location") == LOCATION)
pm25_grouped = pm25_quality.group_by(
(pl.col("date.utc").dt.date()).alias("date"),
(pl.col("date.utc").dt.hour()).alias("hour"),
maintain_order=True,
).sum().sort(["date", "hour"])
by_day = no2_grouped.group_by(
(pl.col("date")),
).sum()
by_day = by_day.sort("value", descending=True)
top = by_day.head(3).select(
pl.col("date"),
pl.col("value"),
)
print(top)
top_hourly = (top.join(no2_grouped, on="date", how="left", suffix="_no2")
.join(pm25_grouped, on=["date", "hour"], how="left", suffix="_pm25"))
top_hourly = top_hourly.group_by(
pl.col("hour"),
).mean().sort("hour")
print(top_hourly.select(
pl.col("hour"),
pl.col("value_no2"),
pl.col("value_pm25"),
))
bottom = by_day.tail(3).select(
pl.col("date"),
pl.col("value"),
)
print(bottom)
bottom_hourly = (bottom.join(no2_grouped, on="date", how="left", suffix="_no2")
.join(pm25_grouped, on=["date", "hour"], how="left", suffix="_pm25")).sort(["date", "hour"])
bottom_hourly = bottom_hourly.group_by(
pl.col("hour"),
).mean().sort("hour")
print(bottom_hourly.select(
pl.col("hour"),
pl.col("value_no2"),
pl.col("value_pm25"),
))
if __name__ == "__main__":
main()
Now the rust code:
use polars::prelude::*;
const LOCATION: &str = "London Westminster";
fn main() -> Result<(), Box<dyn std::error::Error>> {
let no2_quality = CsvReadOptions::default()
.with_infer_schema_length(None)
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some("data/air_quality_no2_long.csv".into()))?
.finish()?;
let no2_quality = no2_quality.lazy().filter(col("location").eq(lit(LOCATION)));
let no2_grouped = no2_quality
.clone()
.group_by_stable([
col("date.utc").dt().date().alias("date"),
col("date.utc").dt().hour().alias("hour"),
])
.agg([sum("value")]);
let pm25_quality = CsvReadOptions::default()
.with_infer_schema_length(None)
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some("data/air_quality_pm25_long.csv".into()))?
.finish()?;
let pm25_quality = pm25_quality
.lazy()
.filter(col("location").eq(lit(LOCATION)));
let pm25_grouped = pm25_quality
.group_by_stable([
col("date.utc").dt().date().alias("date"),
col("date.utc").dt().hour().alias("hour"),
])
.agg([sum("value")]);
let by_day = no2_quality
.group_by([col("date.utc").dt().date().alias("date")])
.agg([sum("value")]);
let by_day = by_day.sort(
["value"],
SortMultipleOptions::default().with_order_descending(true),
);
let top = by_day.clone().limit(3).collect()?; //
println!("{}", top);
let top_hourly = top.lazy().join(
no2_grouped.clone(),
[col("date")],
[col("date")],
JoinArgs::default().with_suffix(Some("_no2".into())),
);
let top_hourly = top_hourly.join(
pm25_grouped.clone(),
[col("date"), col("hour")],
[col("date"), col("hour")],
JoinArgs::default().with_suffix(Some("_pm25".into())),
);
let top_hourly = top_hourly
.group_by([col("hour")])
.agg([mean("value_no2"), mean("value_pm25")])
.sort(["hour"], SortMultipleOptions::default())
.collect()?;
println!("{}", top_hourly);
let bottom = by_day.tail(3).collect()?;
println!("{}", bottom);
let bottom_hourly = bottom.lazy().join(
no2_grouped.clone(),
[col("date")],
[col("date")],
JoinArgs::default().with_suffix(Some("_no2".into())),
);
let bottom_hourly = bottom_hourly.join(
pm25_grouped.clone(),
[col("date"), col("hour")],
[col("date"), col("hour")],
JoinArgs::default().with_suffix(Some("_pm25".into())),
);
let bottom_hourly = bottom_hourly
.group_by([col("hour")])
.agg([mean("value_no2"), mean("value_pm25")])
.sort(["hour"], SortMultipleOptions::default())
.collect()?;
println!("{}", bottom_hourly);
Ok(())
}
The input files are from pandas example:
- .csv
- .csv
It's not just an optimization problem, as I run a release-build (cargo build --release
) with both file sizes.
To compare the python to the rust version and the small file sizes to the bigger I use the sh-integrated tool time
. It runs on the same hardware (Macbook M3 Pro) on the same OS.
2k / 1k:
================
CPU 131%
user 0.116
system 0.032
total 0.113
20k / 10k:
================
CPU 115%
user 0.934
system 0.038
total 0.842
And a different DataFrame lib in rust does not suffer such a performance penalty with bigger files. So I think it's not per se a language topic.
UPDATEs based on comments/feedbacks
collect calls
First I reduced the number of collect
calls to only 4, right before the individual println
-calls. (code above updated to reflect that change)
Reducing the intermediate variables does not have any additional effect.
================
CPU 116%
user 0.967
system 0.046
total 0.870
lto=fat
I added lto=fat in the toml:
[profile.release]
lto = "fat"
Effect:
================
CPU 115%
user 1.031
system 0.043
total 0.925
consistently slightly higher then without. Do I need to combine that with some other optimization?
lto=fat & codegenunit=1
[profile.release]
codegen-units = 1
debug = false
lto = "fat"
Effect:
================
CPU 116%
user 0.931
system 0.047
total 0.842
jemalloc
without lto=fat:
================
CPU 116%
user 0.983
system 0.049
total 0.889
with lto=fat:
================
CPU 116%
user 1.033
system 0.055
total 0.935
nightly build
using nightly build installed via dustup and
[dependencies]
polars = { version = "0.46.0", features = ["lazy", "nightly"] }
Still had no major impact:
================
CPU 116%
user 0.925
system 0.046
total 0.836
Wrong Direction?
As none of the hints from the comments had any measurable improvements I am beginning to suspect the problem is not to be found in processing the data, but in reading it.
So I made following test version:
use jemallocator::Jemalloc;
use polars::prelude::*;
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;
const LOCATION: &str = "London Westminster";
fn main() -> Result<(), Box<dyn std::error::Error>> {
let _no2_quality = CsvReadOptions::default()
.with_infer_schema_length(None)
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some("data/air_quality_no2_long.csv".into()))?
.finish()?;
let _pm25_quality = CsvReadOptions::default()
.with_infer_schema_length(None)
.with_has_header(true)
.with_parse_options(CsvParseOptions::default().with_try_parse_dates(true))
.try_into_reader_with_file_path(Some("data/air_quality_pm25_long.csv".into()))?
.finish()?;
Ok(())
}
Release-Build it without any features and let it run:
================
CPU 111%
user 0.925
system 0.016
total 0.842
So I currently tend to see that as a bug in the csv reading.
Thanks for all your responses and sorry to not have seen that earlier