最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Perform a rolling operation on indices without using `with_row_index()`? - Stack Overflow

programmeradmin0浏览0评论

I have a DataFrame like this:

import polars as pl

df = pl.DataFrame({"x": [1.2, 1.3, 3.4, 3.5]})
df

# shape: (3, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 1.2 │
# │ 1.3 │
# │ 3.4 │
# │ 3.5 │
# └─────┘

I would like to make a rolling aggregation using .rolling() so that each row uses a window [-2:1]:

shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

So far, I managed to do this with the following code:

df.with_row_index("index").with_columns(
  y = pl.col("x").rolling(index_column = "index", period = "4i", offset = "-3i")
).drop("index")

However this requires manually creating a column index and then removing it after the operation. Is there a way to achieve the same result in a single with_columns() call?

I have a DataFrame like this:

import polars as pl

df = pl.DataFrame({"x": [1.2, 1.3, 3.4, 3.5]})
df

# shape: (3, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ f64 │
# ╞═════╡
# │ 1.2 │
# │ 1.3 │
# │ 3.4 │
# │ 3.5 │
# └─────┘

I would like to make a rolling aggregation using .rolling() so that each row uses a window [-2:1]:

shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

So far, I managed to do this with the following code:

df.with_row_index("index").with_columns(
  y = pl.col("x").rolling(index_column = "index", period = "4i", offset = "-3i")
).drop("index")

However this requires manually creating a column index and then removing it after the operation. Is there a way to achieve the same result in a single with_columns() call?

Share Improve this question edited Feb 3 at 14:12 TylerH 21.1k77 gold badges79 silver badges112 bronze badges asked Feb 3 at 11:37 bretauvbretauv 8,5872 gold badges25 silver badges68 bronze badges 1
  • 1 There are plans to allow .rolling(length=5, position=-2) - github/pola-rs/polars/issues/12049 – jqurious Commented Feb 3 at 17:34
Add a comment  | 

2 Answers 2

Reset to default 2

Pure expressions approach (apparently slow)

You can use concat_list with shift

(
    df
    .with_columns(
        y=pl.concat_list(
            pl.col('x').shift(x) 
            for x in range(2,-2,-1)
            )
        .list.drop_nulls()
        )
)
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

There are a couple things to note here.

  1. When the input to shift is positive, that means to go backwards which is the opposite of your notation.
  2. range can count backwards with (start, stop, increment) but stop is non-inclusive so when entering that parameter, it needs an extra -1.
  3. At the end of the concat_list you need to manually drop the nulls that it will have for items at the beginning and end of the series.

As always, you can wrap this into a function, including a translation of your preferred notation to what you actually need in range for it to work.

from typing import Sequence


def my_roll(in_column: str | pl.Expr, window: Sequence):
    if isinstance(in_column, str):
        in_column = pl.col(in_column)
    pl_window = range(-window[0], -window[1] - 1, -1)
    return pl.concat_list(in_column.shift(x) for x in pl_window).list.drop_nulls()

which then allows you to do

df.with_columns(y=my_roll("x", [-2,1]))

If you don't care about static typing you can even monkey patch it to pl.Expr like this pl.Expr.my_roll = my_roll and then do df.with_columns(y=pl.col("x").my_roll([-2,1])) but your pylance/pyright/mypy/etc will complain about it not existing.

Another approach that's kind of cheating if you're an expression purist

You can combine the built in way featuring .with_row_index and .rolling into a .map_batches that just turns your column into a df and spits back the series you care about.

def my_roll(in_column: str | pl.Expr, window):
    if isinstance(in_column, str):
        in_column = pl.col(in_column)
    period = f"{window[1]-window[0]+1}i"
    offset = f"{window[0]-1}i"
    return in_column.map_batches(
        lambda s: (
            s.to_frame()
            .with_row_index()
            .select(
                pl.col(s.name).rolling(
                    index_column="index", 
                    period=period, 
                    offset=offset
                )
            )
            .get_column(s.name)
        )
    )

The way this works is that map_batches will turn your column into a Series and then run a function on it where the function returns another Series. If we make the function turn that Series into a DF, then attach the row_index, do the rolling, and get the resultant Series then that gives you exactly what you want all contained in an expression. It should be just as performant as the verbose way, assuming you don't have any other use of the row_index.

then you do

df.with_columns(y=my_roll("x", [-2,1]))

It looks like pl.Expr.rolling() expects string as as an index column, so need a fixed column. You can use pl.DataFrame.select() instead of pl.DataFrame.with_columns() if it makes it better:

df.with_row_index().select(
    df.columns,
    y = pl.col("x").rolling(index_column="index", period = "4i", offset = "-3i")
)
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘

You could also use pl.DataFrame.rolling() which allows expressions as index column, and use pl.int_range() but it doesn't look much better to be honest:

df.with_columns(
    df
    .rolling(index_column=pl.int_range(pl.len()), period = "4i", offset = "-3i")
    .agg(pl.col.x.alias("y"))["y"]
)
shape: (4, 2)
┌─────┬───────────────────┐
│ x   ┆ y                 │
│ --- ┆ ---               │
│ f64 ┆ list[f64]         │
╞═════╪═══════════════════╡
│ 1.2 ┆ [1.2, 1.3]        │
│ 1.3 ┆ [1.2, 1.3, 3.4]   │
│ 3.4 ┆ [1.2, 1.3, … 3.5] │
│ 3.5 ┆ [1.3, 3.4, 3.5]   │
└─────┴───────────────────┘
发布评论

评论列表(0)

  1. 暂无评论