最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Polars lazy dataframe custom function over rows - Stack Overflow

programmeradmin0浏览0评论

I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not computable via pl expressions.

I get about this far:

import polars as pl

def softmax(t):
    a = np.exp(np.array(t))
    return tuple(t/np.sum(t))

ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()

cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }

ldf.select(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict)).collect()

However, if I want to get a resulting lazy df that contains columns other than cols (such as id), I get stuck, because

ldf.with_columns(pl.col(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict))).collect()

no longer works, because pl.col(cols).map_batches is done column-by-column...

This does not seem like it would be an uncommon use case, so I'm wondering if I'm missing something.

I am trying to run a custom function on a lazy dataframe on a row-by-row basis. Function itself does not matter, so I'm using softmax as a stand-in. All that matters about it is that it is not computable via pl expressions.

I get about this far:

import polars as pl

def softmax(t):
    a = np.exp(np.array(t))
    return tuple(t/np.sum(t))

ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()

cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }

ldf.select(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict)).collect()

However, if I want to get a resulting lazy df that contains columns other than cols (such as id), I get stuck, because

ldf.with_columns(pl.col(cols).map_batches(lambda bdf: bdf.map_rows(softmax).rename(redict))).collect()

no longer works, because pl.col(cols).map_batches is done column-by-column...

This does not seem like it would be an uncommon use case, so I'm wondering if I'm missing something.

Share Improve this question edited Mar 3 at 14:53 jqurious 22.3k5 gold badges20 silver badges39 bronze badges asked Mar 3 at 14:52 velochyvelochy 4553 silver badges14 bronze badges 1
  • FWIW polars is very resistant to row-by-row operations and the apis are in my experience correspondingly limited – 2e0byo Commented Mar 3 at 14:55
Add a comment  | 

2 Answers 2

Reset to default 3

Polars is pretty averse to row by row operations. Generally if you need that, I'd suggest unpivoting (formerly, “melting”) and computing over the id column.

ldf.unpivot(index="id").with_columns(
    pl.col("value").map_batches(softmax).over("id")
).collect()
shape: (9, 3)
┌─────┬──────────┬──────────┐
│ id  ┆ variable ┆ value    │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ str      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 1   ┆ a        ┆ 0.290461 │
│ 2   ┆ a        ┆ 0.249143 │
│ 3   ┆ a        ┆ 0.322043 │
│ 1   ┆ b        ┆ 0.35477  │
│ 2   ┆ b        ┆ 0.249143 │
│ 3   ┆ b        ┆ 0.322043 │
│ 1   ┆ c        ┆ 0.35477  │
│ 2   ┆ c        ┆ 0.501713 │
│ 3   ┆ c        ┆ 0.355913 │
└─────┴──────────┴──────────┘

If you need this back in wide format, you can pivot the resulting DataFrame.

ldf.unpivot(index="id").with_columns(
    pl.col("value").map_batches(softmax).over("id")
).collect().pivot("variable", index="id")
shape: (3, 4)
┌─────┬──────────┬──────────┬──────────┐
│ id  ┆ a        ┆ b        ┆ c        │
│ --- ┆ ---      ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╪══════════╡
│ 1   ┆ 0.290461 ┆ 0.35477  ┆ 0.35477  │
│ 2   ┆ 0.249143 ┆ 0.249143 ┆ 0.501713 │
│ 3   ┆ 0.322043 ┆ 0.322043 ┆ 0.355913 │
└─────┴──────────┴──────────┴──────────┘

I actually found a relatively nice solution that just takes advantage of batches being materialized in memory.

import polars as pl

def softmax(ar):
    a = np.exp(ar)
    return a/np.sum(a,axis=-1)

def apply_npf_on_pl_df(df,cols,npf):
    df[cols] = npf(df[cols].to_numpy())
    return df

ldf = pl.DataFrame({ 'id': [1,2,3], 'a': [0.2,0.1,0.3], 'b': [0.4,0.1,0.3], 'c': [0.4,0.8,0.4]}).lazy()

cols = ['a','b','c']
redict = { f'column_{i}':c for i,c in enumerate(cols) }

ldf.map_batches(lambda bdf: apply_npf_on_pl_df(bdf,cols,softmax)).collect()

This is likely not ideal if there are a lot of other rows, but for my use case (with very few additional columns) this looks pretty efficient.

发布评论

评论列表(0)

  1. 暂无评论