From what I understand this is a main use case for Polars: being able to process a dataset that is larger than RAM, using disk space if necessary. Yet I am unable to achieve this in a Kubernetes environment. To replicate locally I tried launching a docker container with a low memory limit:
docker run -it --memory=500m --rm -v `pwd`:/app python:3.12 /bin/bash
# pip install polars==1.26.0
I checked that it set up the memory limit in cgroups for the current process.
Then I ran a script that loads a moderately large dataframe (23M parquet file, 158M uncompressed), using scan_parquet
, performs a sort, and outputs the head:
source = "parquet/central_west.df"
df = pl.scan_parquet(source, low_memory=True)
query = df.sort("station_code").head()
print(query.collect(engine="streaming"))
This leads to the process getting killed. It works with a smaller dataframe, or a larger limit. Is polars not reading the limit correctly? Or not able to work with that low of a limit? I understand the "new" streaming engine is still in beta, so I tried the same script with version 1.22.0 of polars, but the result was the same. This seems like a very simple and common use case so I hope I am just missing a configuration trick.
On a hunch and based on a similar question I tried setting POLARS_IDEAL_MORSEL_SIZE=100, but that made no difference, and I feel like I am grasping at straws here.