最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Polars upsampling with grouping does not behave as expected - Stack Overflow

programmeradmin2浏览0评论

Here is the data

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 4, 2),
            datetime(2021, 5, 4),
            datetime(2021, 6, 6),
            datetime(2021, 6, 8),
            datetime(2021, 7, 10),
            datetime(2021, 8, 18),
            datetime(2021, 9, 20),
        ],
        "groups": ["A", "B", "A", "B","A","B","A","B"],
        "values": [0, 1, 2, 3,4,5,6,7],
    }
)

The upsampling and the testing:

(
    df
    .upsample(
        time_column="time", 
        every="1d", 
        group_by="groups", 
        maintain_order=True
        )
    .group_by('groups')
    .agg(pl.col('time').diff().max())
    
)
shape: (3, 2)
┌────────┬──────────────┐
│ groups ┆ time         │
│ ---    ┆ ---          │
│ str    ┆ duration[μs] │
╞════════╪══════════════╡
│ A      ┆ 92d          │
│ null   ┆ 2d           │
│ B      ┆ 72d          │
└────────┴──────────────┘

The diff is not 1 day as I would expect. Is this a bug, or am I doing something wrong?

Here is the data

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 4, 2),
            datetime(2021, 5, 4),
            datetime(2021, 6, 6),
            datetime(2021, 6, 8),
            datetime(2021, 7, 10),
            datetime(2021, 8, 18),
            datetime(2021, 9, 20),
        ],
        "groups": ["A", "B", "A", "B","A","B","A","B"],
        "values": [0, 1, 2, 3,4,5,6,7],
    }
)

The upsampling and the testing:

(
    df
    .upsample(
        time_column="time", 
        every="1d", 
        group_by="groups", 
        maintain_order=True
        )
    .group_by('groups')
    .agg(pl.col('time').diff().max())
    
)
shape: (3, 2)
┌────────┬──────────────┐
│ groups ┆ time         │
│ ---    ┆ ---          │
│ str    ┆ duration[μs] │
╞════════╪══════════════╡
│ A      ┆ 92d          │
│ null   ┆ 2d           │
│ B      ┆ 72d          │
└────────┴──────────────┘

The diff is not 1 day as I would expect. Is this a bug, or am I doing something wrong?

Share Improve this question edited Mar 13 at 11:46 jqurious 22.1k5 gold badges20 silver badges39 bronze badges asked Mar 13 at 10:58 JohnRosJohnRos 1,2572 gold badges11 silver badges22 bronze badges 0
Add a comment  | 

1 Answer 1

Reset to default 3

It is due to the group columns resulting in null - which is a bug.

  • https://github/pola-rs/polars/issues/15530

upsample itself is implemented as a datetime_range and join

  • https://github/pola-rs/polars/blob/a4fbc9453cacb7e7e5cc476b30a98845aaa5f506/crates/polars-time/src/upsample.rs#L203

Which you could do manually as a workaround.

(df.group_by("groups")
   .agg(pl.datetime_range(pl.col("time").first(), pl.col("time").last()))
   .explode("time")
   .join(df, on=["groups", "time"], how="left")
   .group_by("groups")
   .agg(pl.col("time").diff().max())
)
shape: (2, 2)
┌────────┬──────────────┐
│ groups ┆ time         │
│ ---    ┆ ---          │
│ str    ┆ duration[μs] │
╞════════╪══════════════╡
│ A      ┆ 1d           │
│ B      ┆ 1d           │
└────────┴──────────────┘
发布评论

评论列表(0)

  1. 暂无评论