最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Polars out of core sorting and memory usage - Stack Overflow

programmeradmin0浏览0评论

From what I understand this is a main use case for Polars: being able to process a dataset that is larger than RAM, using disk space if necessary. Yet I am unable to achieve this in a Kubernetes environment. To replicate locally I tried launching a docker container with a low memory limit:

docker run -it --memory=500m --rm  -v `pwd`:/app python:3.12  /bin/bash
# pip install polars==1.26.0

I checked that it set up the memory limit in cgroups for the current process. Then I ran a script that loads a moderately large dataframe (23M parquet file, 158M uncompressed), using scan_parquet, performs a sort, and outputs the head:

source = "parquet/central_west.df"
df = pl.scan_parquet(source, low_memory=True)
query = df.sort("station_code").head()
print(query.collect(engine="streaming"))

This leads to the process getting killed. It works with a smaller dataframe, or a larger limit. Is polars not reading the limit correctly? Or not able to work with that low of a limit? I understand the "new" streaming engine is still in beta, so I tried the same script with version 1.22.0 of polars, but the result was the same. This seems like a very simple and common use case so I hope I am just missing a configuration trick.

On a hunch and based on a similar question I tried setting POLARS_IDEAL_MORSEL_SIZE=100, but that made no difference, and I feel like I am grasping at straws here.

发布评论

评论列表(0)

  1. 暂无评论