I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations.
Here’s an example of the SQL I’m trying to use:
SUM(volume) OVER (
PARTITION BY ID
ORDER BY td.timestamp
RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume
Problem:
- When there are duplicate timestamps, the RANGE function groups all entries with the same timestamp into the same window, leading to unexpected results.
- I need to process the rows individually (row-wise) within the same timestamp.
Constraints:
- I can't add any slight noise in timestamp column, as it will change my time window. It is to be calculate in precision.
Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?
Input
timestamp | Volume |
---|---|
2024-11-16 08:00:00 | 10 |
2024-11-16 08:00:00 | 20 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 40 |
2024-11-16 08:02:00 | 50 |
I have a scenario where I need to calculate a running total using the SUM window function in SQL. The issue arises because some rows have duplicate timestamps, and the RANGE clause in the window function groups all rows with the same timestamp together, causing incorrect calculations.
Here’s an example of the SQL I’m trying to use:
SUM(volume) OVER (
PARTITION BY ID
ORDER BY td.timestamp
RANGE BETWEEN INTERVAL '60' SECOND PRECEDING AND CURRENT ROW
) AS total_volume
Problem:
- When there are duplicate timestamps, the RANGE function groups all entries with the same timestamp into the same window, leading to unexpected results.
- I need to process the rows individually (row-wise) within the same timestamp.
Constraints:
- I can't add any slight noise in timestamp column, as it will change my time window. It is to be calculate in precision.
Is there a way to adjust the SQL to process rows correctly within the same timestamp range while adhering to the time window logic?
Input
timestamp | Volume |
---|---|
2024-11-16 08:00:00 | 10 |
2024-11-16 08:00:00 | 20 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 40 |
2024-11-16 08:02:00 | 50 |
Current Result (Using RANGE and Grouping by Timestamp)
timestamp | RollVolume |
---|---|
2024-11-16 08:00:00 | 30 |
2024-11-16 08:00:00 | 30 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 90 |
2024-11-16 08:02:00 | 90 |
Expected Output
timestamp | RollVolume |
---|---|
2024-11-16 08:00:00 | 10 |
2024-11-16 08:00:00 | 30 |
2024-11-16 08:01:00 | 30 |
2024-11-16 08:02:00 | 40 |
2024-11-16 08:02:00 | 90 |
Here, The RollVolume is calculated row by row within each timestamp, instead of grouping rows with identical timestamps.
Share Improve this question asked Nov 16, 2024 at 16:25 Saurabh GhadgeSaurabh Ghadge 11 bronze badge 11- in you query you have used partition by ID, but your input data does not have ID, can you correct the input data format and also check if the expected output is correct? – samhita Commented Nov 16, 2024 at 18:40
- Please tag which DBMS you're using. (SQL Server, MySQL, PostgreSQL, Oracle, etc) – MatBailie Commented Nov 16, 2024 at 20:53
- Note: if two rows have the same timestamp, they both happen within 0s of each other. SQL data sets have no implicit ordering, which means that in your data the volume=10 row doesn't occur "before" the volume=20 row (or vice versa). You'd have to assert something like the lowest volume row happens first. – MatBailie Commented Nov 16, 2024 at 20:57
- Hey @samhita , the input data I provided intentionally does not include an ID column, as the operation is meant to be carried out within a specific partition (e.g., ID) and the data given is for only for one specific ID. This means that the rolling calculation should respect both the time-based window (last 60 seconds) and the implicit partitioning by ID. – Saurabh Ghadge Commented Nov 17, 2024 at 2:44
- Hey @MatBailie The data represents events that occur at the same exact timestamp, and their order cannot be determined inherently or by any secondary attribute rather we assume that rows those rows itself has some order . Introducing an assumption (e.g., the lowest volume occurs first) would be arbitrary and could lead to inaccurate calculation. My requirement is to calculate the rolling total strictly based on timestamps within a 60-second window, processing the rows row by row as they appear in the dataset without introducing assumptions about implicit ordering. – Saurabh Ghadge Commented Nov 17, 2024 at 2:48
2 Answers
Reset to default 1The simplest option is to run a cumulative sum from start to finish, without the time RANGE, then use a second cumulative sum to deduct the unwanted rows, with the time RANGE...
This ensures you can use an id column to enforce an ordering without running into errors when trying to use RANGE BETWEEN.
It also ensures the you only include rows "> 60s ago" rather than rows ">= 60s ago".
Performance wise, it only scans the data once, avoiding the cost of correlated sub-queries.
CREATE TABLE example (
id BIGINT GENERATED ALWAYS AS IDENTITY,
x INT,
ts TIMESTAMP,
val INT
)
CREATE TABLE
INSERT INTO
example (x, ts, val)
VALUES
(1, '2024-11-16 08:00:00', 10),
(1, '2024-11-16 08:00:00', 20),
(1, '2024-11-16 08:01:00', 30),
(1, '2024-11-16 08:02:00', 40),
(1, '2024-11-16 08:02:00', 50)
INSERT 0 5
SELECT
*,
SUM(val)
OVER (
PARTITION BY x
ORDER BY ts, id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
-
COALESCE(
SUM(val)
OVER (
PARTITION BY x
ORDER BY ts
RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '60' SECOND PRECEDING
)
,
0
)
AS rolling_total
FROM
example
id | x | ts | val | rolling_total |
---|---|---|---|---|
1 | 1 | 2024-11-16 08:00:00 | 10 | 10 |
2 | 1 | 2024-11-16 08:00:00 | 20 | 30 |
3 | 1 | 2024-11-16 08:01:00 | 30 | 30 |
4 | 1 | 2024-11-16 08:02:00 | 40 | 40 |
5 | 1 | 2024-11-16 08:02:00 | 50 | 90 |
SELECT 5
fiddle
As mentioned in the comments, deterministic row ordering is required for accurate results. The below uses PostgreSQL's ctid
which represents the physical location of each row in a table but can change with table updates.
https://dbfiddle.uk/xxm_Ujpm
WITH ordered AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
FROM input
)
WITH ordered AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
FROM input
)
WITH ordered AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp, Volume, ctid) AS rn
FROM input
)
SELECT
o1.timestamp,
(
SELECT SUM(o2.Volume)
FROM ordered o2
WHERE o2.timestamp > o1.timestamp - INTERVAL '60 seconds'
AND o2.timestamp <= o1.timestamp
AND (
o2.timestamp < o1.timestamp
OR (o2.timestamp = o1.timestamp AND o2.rn <= o1.rn)
)
) AS RollVolume
FROM ordered o1
ORDER BY o1.timestamp, o1.rn;