I am trying to write a query to generate fixed non-overlapping windows and counting the number of records within each window. The next window starts on the next date of record after the previous window's end date. There are gaps between dates. I think this a form of gaps and islands problem. I am using PostgresSQL.
Here are the constraints
- The data is partitioned by id, ts
- Fixed 7 day window
- The windows are non-overlapping
- The next window starts on the first day of record after the end date of the previous window
The issue I am having is how to determine the next window start given the time gap after the last window ends.
Here is example data:
id|ts |
--+-----------------------+
1|2024-12-20 21:48:24.877|
1|2025-01-04 03:17:32.757|
1|2025-01-17 20:14:57.942|
2|2025-01-02 22:57:29.979|
2|2025-01-15 16:16:17.941|
2|2025-01-16 16:25:20.665|
2|2025-01-29 16:17:04.410|
2|2025-01-30 16:26:21.598|
3|2024-12-19 20:33:39.793|
3|2024-12-28 06:44:24.236|
3|2024-12-31 05:13:19.438|
3|2025-01-03 10:14:29.228|
3|2025-01-09 18:11:22.303|
3|2025-01-10 18:32:00.508|
3|2025-01-12 20:21:10.596|
3|2025-01-16 17:40:39.347|
The expected output:
id|window_start |window_end |count|
--+-----------------------+-----------------------+-----|
1|2024-12-20 21:48:24.877|2024-12-27 21:48:24.877| 1|
1|2025-01-04 03:17:32.757|2025-01-11 03:17:32.757| 1|
1|2025-01-17 20:14:57.942|2025-01-24 20:14:57.942| 1|
2|2025-01-02 22:57:29.979|2025-01-09 22:57:29.979| 1|
2|2025-01-15 22:57:29.979|2025-01-22 22:57:29.979| 2|
2|2025-01-29 22:57:29.979|2025-02-05 22:57:29.979| 2|
3|2024-12-19 20:33:39.793|2024-12-26 20:33:39.793| 1|
3|2024-12-28 06:44:24.236|2025-01-04 06:44:24.236| 3|
3|2025-01-09 18:11:22.303|2025-01-16 18:11:22.303| 4|
I am trying to write a query to generate fixed non-overlapping windows and counting the number of records within each window. The next window starts on the next date of record after the previous window's end date. There are gaps between dates. I think this a form of gaps and islands problem. I am using PostgresSQL.
Here are the constraints
- The data is partitioned by id, ts
- Fixed 7 day window
- The windows are non-overlapping
- The next window starts on the first day of record after the end date of the previous window
The issue I am having is how to determine the next window start given the time gap after the last window ends.
Here is example data:
id|ts |
--+-----------------------+
1|2024-12-20 21:48:24.877|
1|2025-01-04 03:17:32.757|
1|2025-01-17 20:14:57.942|
2|2025-01-02 22:57:29.979|
2|2025-01-15 16:16:17.941|
2|2025-01-16 16:25:20.665|
2|2025-01-29 16:17:04.410|
2|2025-01-30 16:26:21.598|
3|2024-12-19 20:33:39.793|
3|2024-12-28 06:44:24.236|
3|2024-12-31 05:13:19.438|
3|2025-01-03 10:14:29.228|
3|2025-01-09 18:11:22.303|
3|2025-01-10 18:32:00.508|
3|2025-01-12 20:21:10.596|
3|2025-01-16 17:40:39.347|
The expected output:
id|window_start |window_end |count|
--+-----------------------+-----------------------+-----|
1|2024-12-20 21:48:24.877|2024-12-27 21:48:24.877| 1|
1|2025-01-04 03:17:32.757|2025-01-11 03:17:32.757| 1|
1|2025-01-17 20:14:57.942|2025-01-24 20:14:57.942| 1|
2|2025-01-02 22:57:29.979|2025-01-09 22:57:29.979| 1|
2|2025-01-15 22:57:29.979|2025-01-22 22:57:29.979| 2|
2|2025-01-29 22:57:29.979|2025-02-05 22:57:29.979| 2|
3|2024-12-19 20:33:39.793|2024-12-26 20:33:39.793| 1|
3|2024-12-28 06:44:24.236|2025-01-04 06:44:24.236| 3|
3|2025-01-09 18:11:22.303|2025-01-16 18:11:22.303| 4|
Share
Improve this question
edited Mar 5 at 21:56
Chace
asked Mar 5 at 15:49
ChaceChace
581 silver badge6 bronze badges
5
|
2 Answers
Reset to default 1- First you can identify "undoubtful window starts"
as those clear of another ts in the preceding 7 days - Then create "continents", that is, group timestamps until a 7-day hole arises, first without trying to determine where the 7-day cuts occur.
- Now in those continents, split in islands by iterating:
Starting from the continent start, eat everything within 7 days (counting entries as the window count), then push the next entry as a new island start. Which will eat everything within 7 days, and so on.
Decomposing this reasoning with CTEs gives the SQL below.
Note that to avoid computing the "next" inside the recursive CTE,
I prefered precomputing the "next" for each entry "in case it later became an island start".
with recursive
-- Identify unambiguous window starts: those with no previous ts in the 7 previous days.
maybe as
(
select
*,
row_number() over w num, -- Will ease our reading of results.
-- startpoint:
-- - true: confirmed start of a new window
-- - null: maybe, maybe not; will be later decided (depending on if the previous ts (nearer than 7 days ago), has itself been eaten by a previous window (thus let us be a new start) or not (then the previous is a start and we're eaten by it)).
case when lag(ts) over w >= ts - interval '7d' then null else true end startpoint
from ts
window w as (partition by id order by ts)
),
-- Continents of ts never more than 7 days far one from another.
c as
(
select
*,
-- Identify it by the num of the unambiguous starting point.
max(case when startpoint then num end) over w continent,
-- Now attributes for *hypothetical* new island starts:
-- for each ts, choose its successor in case this one becomes an island start
-- The successor is the first row from the same continent, but further than 7 days from this one.
min(num) over (w range between '7d' following and unbounded following) successor,
-- Number of rows which would belong to this 7 days window (in case the current row is a window start).
count(1) over (w range between current row and '7d' following) n_included
from maybe
window w as (partition by id order by ts)
),
-- Now iterate starting from the continents,
-- to see if we can determine islands within them.
-- (each start of island "eats" the 7 following days, so the first row after 7 days can be elected as the start of a new island)
i as
(
select * from c where startpoint
union
select next.*
-- Do not fet to filter on island, as successor has been computed without this criteria (before we had determined islands).
from i join c next using (id, continent)
where next.num = i.successor
)
select * from i order by id, num;
Demo in a Fiddle
Maybe possible to simplify:
with recursive
rdata as
(
select d.*, row_number() over(partition by d.id order by d.ts) as rn,
count(d.id) over(partition by d.id order by d.ts range between current row and interval '7d' following) as cnt
from ts d
),
solution(id, ts, rn, window_start, window_end, cnt) as
(
select id, ts, rn,
ts,
ts + interval '7d' ,
cnt
from rdata
where rn = 1
union all
select r.id, d.ts, d.rn,
case when d.ts <= r.window_end
then null -- eaten
else d.ts
end,
case when d.ts <= r.window_end
then r.window_end
else d.ts + interval '7d'
end,
d.cnt
from solution r
join rdata d on r.id = d.id and d.rn = r.rn +1
)
select id, window_start, window_end, cnt from solution
where window_start is not null
order by id, ts
;
https://dbfiddle.uk/4wMHk1Lx
or
('2025-01-01'-'2025-01-07'),('2025-01-09'-'2025-01-16')? – ValNik Commented Mar 5 at 17:22