最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Postgres Fixed 7-day window shifting data - Stack Overflow

programmeradmin1浏览0评论

I am trying to write a query to generate fixed non-overlapping windows and counting the number of records within each window. The next window starts on the next date of record after the previous window's end date. There are gaps between dates. I think this a form of gaps and islands problem. I am using PostgresSQL.

Here are the constraints

  • The data is partitioned by id, ts
  • Fixed 7 day window
  • The windows are non-overlapping
  • The next window starts on the first day of record after the end date of the previous window

The issue I am having is how to determine the next window start given the time gap after the last window ends.

Here is example data:

id|ts                     |
--+-----------------------+
 1|2024-12-20 21:48:24.877|
 1|2025-01-04 03:17:32.757|
 1|2025-01-17 20:14:57.942|
 2|2025-01-02 22:57:29.979|
 2|2025-01-15 16:16:17.941|
 2|2025-01-16 16:25:20.665|
 2|2025-01-29 16:17:04.410|
 2|2025-01-30 16:26:21.598|
 3|2024-12-19 20:33:39.793|
 3|2024-12-28 06:44:24.236|
 3|2024-12-31 05:13:19.438|
 3|2025-01-03 10:14:29.228|
 3|2025-01-09 18:11:22.303|
 3|2025-01-10 18:32:00.508|
 3|2025-01-12 20:21:10.596|
 3|2025-01-16 17:40:39.347|

The expected output:

id|window_start           |window_end             |count|
--+-----------------------+-----------------------+-----|
 1|2024-12-20 21:48:24.877|2024-12-27 21:48:24.877|    1|
 1|2025-01-04 03:17:32.757|2025-01-11 03:17:32.757|    1|
 1|2025-01-17 20:14:57.942|2025-01-24 20:14:57.942|    1|
 2|2025-01-02 22:57:29.979|2025-01-09 22:57:29.979|    1|
 2|2025-01-15 22:57:29.979|2025-01-22 22:57:29.979|    2|
 2|2025-01-29 22:57:29.979|2025-02-05 22:57:29.979|    2|
 3|2024-12-19 20:33:39.793|2024-12-26 20:33:39.793|    1|
 3|2024-12-28 06:44:24.236|2025-01-04 06:44:24.236|    3|
 3|2025-01-09 18:11:22.303|2025-01-16 18:11:22.303|    4|

I am trying to write a query to generate fixed non-overlapping windows and counting the number of records within each window. The next window starts on the next date of record after the previous window's end date. There are gaps between dates. I think this a form of gaps and islands problem. I am using PostgresSQL.

Here are the constraints

  • The data is partitioned by id, ts
  • Fixed 7 day window
  • The windows are non-overlapping
  • The next window starts on the first day of record after the end date of the previous window

The issue I am having is how to determine the next window start given the time gap after the last window ends.

Here is example data:

id|ts                     |
--+-----------------------+
 1|2024-12-20 21:48:24.877|
 1|2025-01-04 03:17:32.757|
 1|2025-01-17 20:14:57.942|
 2|2025-01-02 22:57:29.979|
 2|2025-01-15 16:16:17.941|
 2|2025-01-16 16:25:20.665|
 2|2025-01-29 16:17:04.410|
 2|2025-01-30 16:26:21.598|
 3|2024-12-19 20:33:39.793|
 3|2024-12-28 06:44:24.236|
 3|2024-12-31 05:13:19.438|
 3|2025-01-03 10:14:29.228|
 3|2025-01-09 18:11:22.303|
 3|2025-01-10 18:32:00.508|
 3|2025-01-12 20:21:10.596|
 3|2025-01-16 17:40:39.347|

The expected output:

id|window_start           |window_end             |count|
--+-----------------------+-----------------------+-----|
 1|2024-12-20 21:48:24.877|2024-12-27 21:48:24.877|    1|
 1|2025-01-04 03:17:32.757|2025-01-11 03:17:32.757|    1|
 1|2025-01-17 20:14:57.942|2025-01-24 20:14:57.942|    1|
 2|2025-01-02 22:57:29.979|2025-01-09 22:57:29.979|    1|
 2|2025-01-15 22:57:29.979|2025-01-22 22:57:29.979|    2|
 2|2025-01-29 22:57:29.979|2025-02-05 22:57:29.979|    2|
 3|2024-12-19 20:33:39.793|2024-12-26 20:33:39.793|    1|
 3|2024-12-28 06:44:24.236|2025-01-04 06:44:24.236|    3|
 3|2025-01-09 18:11:22.303|2025-01-16 18:11:22.303|    4|
Share Improve this question edited Mar 5 at 21:56 Chace asked Mar 5 at 15:49 ChaceChace 581 silver badge6 bronze badges 5
  • 1 Have you tried anything? – Chris Maurer Commented Mar 5 at 16:08
  • I have been able to generate 7-day fixed windows, one after the other, but not dynamically shift the window to the next available record after the first window ended. – Chace Commented Mar 5 at 17:18
  • Case window_start is ('2025-01-01'),('2025-01-09')('2025-01-16') is output ranges ('2025-01-01'-'2025-01-07'),('2025-01-08'-'2025-01-15'),('2025-01-16'-'2025-01-23') or ('2025-01-01'-'2025-01-07'),('2025-01-09'-'2025-01-16')? – ValNik Commented Mar 5 at 17:22
  • @ValNik it is the later, ('2025-01-01'-'2025-01-07'),('2025-01-09'-'2025-01-16') – Chace Commented Mar 5 at 17:51
  • Simplest way is recursively iterate through all the rows (one by one). Recursion level in this case equal to table row count. Second is recursively iterate through groups. Recursion level in this case equal to table row_count/avg(rows in interval). What is your real table row count and averaged row count in interval (window)? – ValNik Commented Mar 5 at 19:04
Add a comment  | 

2 Answers 2

Reset to default 1
  • First you can identify "undoubtful window starts"
    as those clear of another ts in the preceding 7 days
  • Then create "continents", that is, group timestamps until a 7-day hole arises, first without trying to determine where the 7-day cuts occur.
  • Now in those continents, split in islands by iterating:
    Starting from the continent start, eat everything within 7 days (counting entries as the window count), then push the next entry as a new island start. Which will eat everything within 7 days, and so on.

Decomposing this reasoning with CTEs gives the SQL below.
Note that to avoid computing the "next" inside the recursive CTE,
I prefered precomputing the "next" for each entry "in case it later became an island start".

with recursive
  -- Identify unambiguous window starts: those with no previous ts in the 7 previous days.
  maybe as
  (
    select
      *,
      row_number() over w num, -- Will ease our reading of results.
      -- startpoint:
      -- - true: confirmed start of a new window
      -- - null: maybe, maybe not; will be later decided (depending on if the previous ts (nearer than 7 days ago), has itself been eaten by a previous window (thus let us be a new start) or not (then the previous is a start and we're eaten by it)).
      case when lag(ts) over w >= ts - interval '7d' then null else true end startpoint
    from ts
    window w as (partition by id order by ts)
  ),
  -- Continents of ts never more than 7 days far one from another.
  c as
  (
    select
      *,
      -- Identify it by the num of the unambiguous starting point.
      max(case when startpoint then num end) over w continent,
      -- Now attributes for *hypothetical* new island starts:
      -- for each ts, choose its successor in case this one becomes an island start
      -- The successor is the first row from the same continent, but further than 7 days from this one.
      min(num) over (w range between '7d' following and unbounded following) successor,
      -- Number of rows which would belong to this 7 days window (in case the current row is a window start).
      count(1) over (w range between current row and '7d' following) n_included
    from maybe
    window w as (partition by id order by ts)
  ),
  -- Now iterate starting from the continents,
  -- to see if we can determine islands within them.
  -- (each start of island "eats" the 7 following days, so the first row after 7 days can be elected as the start of a new island)
  i as
  (
    select * from c where startpoint
    union
    select next.*
    -- Do not fet to filter on island, as successor has been computed without this criteria (before we had determined islands).
    from i join c next using (id, continent)
    where next.num = i.successor
  )
select * from i order by id, num;

Demo in a Fiddle

Maybe possible to simplify:

with recursive
  rdata as
  (
    select d.*, row_number() over(partition by d.id order by d.ts) as rn,
        count(d.id) over(partition by d.id  order by d.ts range between current row and interval '7d' following) as cnt
    from ts d
  ),
  solution(id, ts, rn, window_start, window_end, cnt) as
  (
    select id, ts, rn, 
        ts, 
        ts + interval '7d' ,
        cnt
    from rdata
    where rn = 1
    
    union all
    
    select r.id, d.ts, d.rn, 
        case when d.ts <= r.window_end 
            then null -- eaten
            else d.ts
        end,
        case when d.ts <= r.window_end
            then r.window_end 
            else d.ts + interval '7d'
        end,
        d.cnt
    from solution r
    join rdata d on r.id = d.id and d.rn = r.rn +1
)
select id, window_start, window_end, cnt from solution
where window_start is not null
order by id, ts
;

https://dbfiddle.uk/4wMHk1Lx

发布评论

评论列表(0)

  1. 暂无评论