最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Calculate cumulative sum of time series X for time points in series Y - Stack Overflow

programmeradmin0浏览0评论

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

With pandas, I would use apply to perform such an operation. With Polars, I tried using map_elements. I have also considered group_by_dynamic but I am not sure it gives me control of the time grid's start / end / increment.

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
) 

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

With pandas, I would use apply to perform such an operation. With Polars, I tried using map_elements. I have also considered group_by_dynamic but I am not sure it gives me control of the time grid's start / end / increment.

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
) 
Share Improve this question edited Feb 3 at 17:07 jqurious 21.7k5 gold badges20 silver badges39 bronze badges asked Feb 2 at 11:36 Dimitri ShvorobDimitri Shvorob 5352 gold badges8 silver badges24 bronze badges 1
  • It seems like your example excludes cases that you might care about, such as multiple entries in df for just one in dg. Additionally, and most importantly, you should have an expected output to go along with your example. – Dean MacGregor Commented Feb 3 at 18:00
Add a comment  | 

1 Answer 1

Reset to default 1

This can be achieved relying purely on polar's native expressions API.

As first step, we can associate each row in df with the earliest timestamp in dg equal or later than the corresponding timestamp in df. For this, pl.DataFrame.join_asof with strategy="forward" might be used.

df.join_asof(dg, on="time", strategy="forward", coalesce=False)
shape: (3, 3)
┌─────────────────────┬────────┬─────────────────────┐
│ time                ┆ amount ┆ time_right          │
│ ---                 ┆ ---    ┆ ---                 │
│ datetime[μs]        ┆ f64    ┆ datetime[μs]        │
╞═════════════════════╪════════╪═════════════════════╡
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 2025-02-02 11:01:00 │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 2025-02-02 11:02:00 │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 2025-02-02 11:03:00 │
└─────────────────────┴────────┴─────────────────────┘

Next, we can join use these timestamps to join the amount values to dg.

dg.join(
    (
        df
        .join_asof(dg, on="time", strategy="forward", coalesce=False)
        .select("amount", pl.col("time_right").alias("time"))
    ),
    on="time",
    how="left",
)
shape: (6, 2)
┌─────────────────────┬────────┐
│ time                ┆ amount │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ f64    │
╞═════════════════════╪════════╡
│ 2025-02-02 11:00:00 ┆ null   │
│ 2025-02-02 11:01:00 ┆ 5.0    │
│ 2025-02-02 11:02:00 ┆ -1.0   │
│ 2025-02-02 11:03:00 ┆ 10.0   │
│ 2025-02-02 11:04:00 ┆ null   │
│ 2025-02-02 11:05:00 ┆ null   │
└─────────────────────┴────────┘

Note that we rename the column in the dataframe returned by pl.DataFrame.join_asof before merging back to dg.

While not shown in this example, it might be the case that there are now duplicate rows for a given timestamp (as there might be multiple values in df associated with a given timestamp in dg). Hence, we first aggregate the amount values for each timestamp. Then, we can perform a regular cumulative sum.

(
    dg
    .join(
        (
            df
            .join_asof(dg, on="time", strategy="forward", coalesce=False)
            .select("amount", pl.col("time_right").alias("time"))
        ),
        on="time",
        how="left",
    )
    .group_by("time").agg(pl.col("amount").sum())
    .with_columns(
        pl.col("amount").cum_sum().name.prefix("cum_")
    )
)
shape: (6, 3)
┌─────────────────────┬────────┬────────────┐
│ time                ┆ amount ┆ cum_amount │
│ ---                 ┆ ---    ┆ ---        │
│ datetime[μs]        ┆ f64    ┆ f64        │
╞═════════════════════╪════════╪════════════╡
│ 2025-02-02 11:00:00 ┆ 0.0    ┆ 0.0        │
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 5.0        │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 4.0        │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 14.0       │
│ 2025-02-02 11:04:00 ┆ 0.0    ┆ 14.0       │
│ 2025-02-02 11:05:00 ┆ 0.0    ┆ 14.0       │
└─────────────────────┴────────┴────────────┘
发布评论

评论列表(0)

  1. 暂无评论