python - Calculate cumulative sum of time series X for time points in series Y

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

With pandas, I would use apply to perform such an operation. With Polars, I tried using map_elements. I have also considered group_by_dynamic but I am not sure it gives me control of the time grid's start / end / increment.

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
)

Imagine transactions, identified by amount, arriving throughout the day. You want to calculate the running total of amount at given points in time (9 am, 10 am, etc.).

Is there a better way?

import polars as pl
import datetime

df = pl.DataFrame({
    "time": [
        datetime.datetime(2025, 2, 2, 11, 1),
        datetime.datetime(2025, 2, 2, 11, 2),
        datetime.datetime(2025, 2, 2, 11, 3)
    ],
    "amount": [5.0, -1, 10]  
}) 

dg = pl.DataFrame(
    pl.datetime_range(
        datetime.datetime(2025, 2, 2, 11, 0), 
        datetime.datetime(2025, 2, 2, 11, 5), 
        "1m",
        eager = True
    ),
    schema=["time"]
)

def _cumsum(dt):
    return df.filter(pl.col("time") <= dt).select(pl.col("amount")).sum().item()

dg.with_columns(
    cum_amount=pl.col("time").map_elements(_cumsum, return_dtype= pl.Float64)
)

Share Improve this question edited Feb 3 at 17:07 jqurious 21.7k5 gold badges20 silver badges39 bronze badges asked Feb 2 at 11:36 Dimitri Shvorob 5352 gold badges8 silver badges24 bronze badges

It seems like your example excludes cases that you might care about, such as multiple entries in df for just one in dg. Additionally, and most importantly, you should have an expected output to go along with your example. – Dean MacGregor Commented Feb 3 at 18:00

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

This can be achieved relying purely on polar's native expressions API.

As first step, we can associate each row in df with the earliest timestamp in dg equal or later than the corresponding timestamp in df. For this, pl.DataFrame.join_asof with strategy="forward" might be used.

df.join_asof(dg, on="time", strategy="forward", coalesce=False)

shape: (3, 3)
┌─────────────────────┬────────┬─────────────────────┐
│ time                ┆ amount ┆ time_right          │
│ ---                 ┆ ---    ┆ ---                 │
│ datetime[μs]        ┆ f64    ┆ datetime[μs]        │
╞═════════════════════╪════════╪═════════════════════╡
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 2025-02-02 11:01:00 │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 2025-02-02 11:02:00 │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 2025-02-02 11:03:00 │
└─────────────────────┴────────┴─────────────────────┘

Next, we can join use these timestamps to join the amount values to dg.

dg.join(
    (
        df
        .join_asof(dg, on="time", strategy="forward", coalesce=False)
        .select("amount", pl.col("time_right").alias("time"))
    ),
    on="time",
    how="left",
)

shape: (6, 2)
┌─────────────────────┬────────┐
│ time                ┆ amount │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ f64    │
╞═════════════════════╪════════╡
│ 2025-02-02 11:00:00 ┆ null   │
│ 2025-02-02 11:01:00 ┆ 5.0    │
│ 2025-02-02 11:02:00 ┆ -1.0   │
│ 2025-02-02 11:03:00 ┆ 10.0   │
│ 2025-02-02 11:04:00 ┆ null   │
│ 2025-02-02 11:05:00 ┆ null   │
└─────────────────────┴────────┘

Note that we rename the column in the dataframe returned by pl.DataFrame.join_asof before merging back to dg.

While not shown in this example, it might be the case that there are now duplicate rows for a given timestamp (as there might be multiple values in df associated with a given timestamp in dg). Hence, we first aggregate the amount values for each timestamp. Then, we can perform a regular cumulative sum.

(
    dg
    .join(
        (
            df
            .join_asof(dg, on="time", strategy="forward", coalesce=False)
            .select("amount", pl.col("time_right").alias("time"))
        ),
        on="time",
        how="left",
    )
    .group_by("time").agg(pl.col("amount").sum())
    .with_columns(
        pl.col("amount").cum_sum().name.prefix("cum_")
    )
)

shape: (6, 3)
┌─────────────────────┬────────┬────────────┐
│ time                ┆ amount ┆ cum_amount │
│ ---                 ┆ ---    ┆ ---        │
│ datetime[μs]        ┆ f64    ┆ f64        │
╞═════════════════════╪════════╪════════════╡
│ 2025-02-02 11:00:00 ┆ 0.0    ┆ 0.0        │
│ 2025-02-02 11:01:00 ┆ 5.0    ┆ 5.0        │
│ 2025-02-02 11:02:00 ┆ -1.0   ┆ 4.0        │
│ 2025-02-02 11:03:00 ┆ 10.0   ┆ 14.0       │
│ 2025-02-02 11:04:00 ┆ 0.0    ┆ 14.0       │
│ 2025-02-02 11:05:00 ┆ 0.0    ┆ 14.0       │
└─────────────────────┴────────┴────────────┘

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Calculate cumulative sum of time series X for time points in series Y - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)