Here is the data
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"time": [
datetime(2021, 2, 1),
datetime(2021, 4, 2),
datetime(2021, 5, 4),
datetime(2021, 6, 6),
datetime(2021, 6, 8),
datetime(2021, 7, 10),
datetime(2021, 8, 18),
datetime(2021, 9, 20),
],
"groups": ["A", "B", "A", "B","A","B","A","B"],
"values": [0, 1, 2, 3,4,5,6,7],
}
)
The upsampling and the testing:
(
df
.upsample(
time_column="time",
every="1d",
group_by="groups",
maintain_order=True
)
.group_by('groups')
.agg(pl.col('time').diff().max())
)
shape: (3, 2)
┌────────┬──────────────┐
│ groups ┆ time │
│ --- ┆ --- │
│ str ┆ duration[μs] │
╞════════╪══════════════╡
│ A ┆ 92d │
│ null ┆ 2d │
│ B ┆ 72d │
└────────┴──────────────┘
The diff is not 1 day as I would expect. Is this a bug, or am I doing something wrong?
Here is the data
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"time": [
datetime(2021, 2, 1),
datetime(2021, 4, 2),
datetime(2021, 5, 4),
datetime(2021, 6, 6),
datetime(2021, 6, 8),
datetime(2021, 7, 10),
datetime(2021, 8, 18),
datetime(2021, 9, 20),
],
"groups": ["A", "B", "A", "B","A","B","A","B"],
"values": [0, 1, 2, 3,4,5,6,7],
}
)
The upsampling and the testing:
(
df
.upsample(
time_column="time",
every="1d",
group_by="groups",
maintain_order=True
)
.group_by('groups')
.agg(pl.col('time').diff().max())
)
shape: (3, 2)
┌────────┬──────────────┐
│ groups ┆ time │
│ --- ┆ --- │
│ str ┆ duration[μs] │
╞════════╪══════════════╡
│ A ┆ 92d │
│ null ┆ 2d │
│ B ┆ 72d │
└────────┴──────────────┘
The diff is not 1 day as I would expect. Is this a bug, or am I doing something wrong?
Share Improve this question edited Mar 13 at 11:46 jqurious 22.1k5 gold badges20 silver badges39 bronze badges asked Mar 13 at 10:58 JohnRosJohnRos 1,2572 gold badges11 silver badges22 bronze badges 01 Answer
Reset to default 3It is due to the group columns resulting in null
- which is a bug.
- https://github/pola-rs/polars/issues/15530
upsample
itself is implemented as a datetime_range
and join
- https://github/pola-rs/polars/blob/a4fbc9453cacb7e7e5cc476b30a98845aaa5f506/crates/polars-time/src/upsample.rs#L203
Which you could do manually as a workaround.
(df.group_by("groups")
.agg(pl.datetime_range(pl.col("time").first(), pl.col("time").last()))
.explode("time")
.join(df, on=["groups", "time"], how="left")
.group_by("groups")
.agg(pl.col("time").diff().max())
)
shape: (2, 2)
┌────────┬──────────────┐
│ groups ┆ time │
│ --- ┆ --- │
│ str ┆ duration[μs] │
╞════════╪══════════════╡
│ A ┆ 1d │
│ B ┆ 1d │
└────────┴──────────────┘