How can I perform a SUM window function with a time range but handle duplicate timestamps row-wise in SQL

I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations.

Here’s an example of the SQL I’m trying to use:

SUM(volume) OVER (
    PARTITION BY ID
    ORDER BY td.timestamp
    RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume

Problem:

When there are duplicate timestamps, the RANGE function groups all entries with the same timestamp into the same window, leading to unexpected results.
I need to process the rows individually (row-wise) within the same timestamp.

Constraints:

I can't add any slight noise in timestamp column, as it will change my time window. It is to be calculate in precision.

Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?

Input

timestamp	Volume
2024-11-16 08:00:00	10
2024-11-16 08:00:00	20
2024-11-16 08:01:00	30
2024-11-16 08:02:00	40
2024-11-16 08:02:00	50

Here’s an example of the SQL I’m trying to use:

SUM(volume) OVER (
    PARTITION BY ID
    ORDER BY td.timestamp
    RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume

Problem:

When there are duplicate timestamps, the RANGE function groups all entries with the same timestamp into the same window, leading to unexpected results.
I need to process the rows individually (row-wise) within the same timestamp.

Constraints:

I can't add any slight noise in timestamp column, as it will change my time window. It is to be calculate in precision.

Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?

Input

timestamp	Volume
2024-11-16 08:00:00	10
2024-11-16 08:00:00	20
2024-11-16 08:01:00	30
2024-11-16 08:02:00	40
2024-11-16 08:02:00	50

Current Result (Using RANGE and Grouping by Timestamp)

timestamp	RollVolume
2024-11-16 08:00:00	30
2024-11-16 08:00:00	30
2024-11-16 08:01:00	30
2024-11-16 08:02:00	90
2024-11-16 08:02:00	90

Expected Output

timestamp	RollVolume
2024-11-16 08:00:00	10
2024-11-16 08:00:00	30
2024-11-16 08:01:00	30
2024-11-16 08:02:00	40
2024-11-16 08:02:00	90

Here, The RollVolume is calculated row by row within each timestamp, instead of grouping rows with identical timestamps.

Share Improve this question asked Nov 16, 2024 at 16:25 Saurabh Ghadge 11 bronze badge

in you query you have used partition by ID, but your input data does not have ID, can you correct the input data format and also check if the expected output is correct? – samhita Commented Nov 16, 2024 at 18:40
Please tag which DBMS you're using. (SQL Server, MySQL, PostgreSQL, Oracle, etc) – MatBailie Commented Nov 16, 2024 at 20:53
Note: if two rows have the same timestamp, they both happen within 0s of each other. SQL data sets have no implicit ordering, which means that in your data the volume=10 row doesn't occur "before" the volume=20 row (or vice versa). You'd have to assert something like the lowest volume row happens first. – MatBailie Commented Nov 16, 2024 at 20:57
Hey @samhita , the input data I provided intentionally does not include an ID column, as the operation is meant to be carried out within a specific partition (e.g., ID) and the data given is for only for one specific ID. This means that the rolling calculation should respect both the time-based window (last 60 seconds) and the implicit partitioning by ID. – Saurabh Ghadge Commented Nov 17, 2024 at 2:44
Hey @MatBailie The data represents events that occur at the same exact timestamp, and their order cannot be determined inherently or by any secondary attribute rather we assume that rows those rows itself has some order . Introducing an assumption (e.g., the lowest volume occurs first) would be arbitrary and could lead to inaccurate calculation. My requirement is to calculate the rolling total strictly based on timestamps within a 60-second window, processing the rows row by row as they appear in the dataset without introducing assumptions about implicit ordering. – Saurabh Ghadge Commented Nov 17, 2024 at 2:48

| Show 6 more comments

2 Answers 2

Sorted by: Reset to default 1

The simplest option is to run a cumulative sum from start to finish, without the time RANGE, then use a second cumulative sum to deduct the unwanted rows, with the time RANGE...

This ensures you can use an id column to enforce an ordering without running into errors when trying to use RANGE BETWEEN.

It also ensures the you only include rows "> 60s ago" rather than rows ">= 60s ago".

Performance wise, it only scans the data once, avoiding the cost of correlated sub-queries.

CREATE TABLE example (
  id    BIGINT GENERATED ALWAYS AS IDENTITY,
  x     INT, 
  ts    TIMESTAMP,
  val   INT
)

CREATE TABLE

INSERT INTO
  example (x, ts, val)
VALUES
  (1, '2024-11-16 08:00:00',    10),
  (1, '2024-11-16 08:00:00',    20), 
  (1, '2024-11-16 08:01:00',    30),
  (1, '2024-11-16 08:02:00',    40), 
  (1, '2024-11-16 08:02:00',    50)

INSERT 0 5

SELECT
  *, 
  SUM(val)
    OVER (
      PARTITION BY x
          ORDER BY ts, id
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  )
  -
  COALESCE(
    SUM(val)
      OVER (
         PARTITION BY x
             ORDER BY ts
        RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '60' SECOND PRECEDING
    )
    ,
    0
  )
    AS rolling_total 
FROM
  example

id	x	ts	val	rolling_total
1	1	2024-11-16 08:00:00	10	10
2	1	2024-11-16 08:00:00	20	30
3	1	2024-11-16 08:01:00	30	30
4	1	2024-11-16 08:02:00	40	40
5	1	2024-11-16 08:02:00	50	90

SELECT 5

fiddle

As mentioned in the comments, deterministic row ordering is required for accurate results. The below uses PostgreSQL's ctid which represents the physical location of each row in a table but can change with table updates.

https://dbfiddle.uk/xxm_Ujpm

WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
WITH ordered AS (
    SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
    FROM input
)
SELECT
    o1.timestamp,
    (
        SELECT SUM(o2.Volume)
        FROM ordered o2
        WHERE o2.timestamp > o1.timestamp - INTERVAL '60 seconds'
          AND o2.timestamp <= o1.timestamp
          AND ( 
              o2.timestamp < o1.timestamp 
              OR (o2.timestamp = o1.timestamp AND o2.rn <= o1.rn) 
          )
    ) AS RollVolume
FROM ordered o1
ORDER BY o1.timestamp, o1.rn;

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

How can I perform a SUM window function with a time range but handle duplicate timestamps row-wise in SQL - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)