The clickhouse documentation states:
lagInFrame behavior differs from the standard SQL lag window function. Clickhouse window function lagInFrame respects the window frame.
What is this window frame, and how does it affect the output?
Example: I want to find rows in a time series where the difference in the time column value for two consecutive rows is greater than a given threshold.
I want to compare each row with the previous row.
The following leads me to believe that the query below is the correct approach.
To get behavior identical to the lag, use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
SELECT lag, startdate, diff
FROM (SELECT startdate,
lagInFrame(startdate)
OVER(
ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS lag,
Date_diff('minute', lag, startdate) AS diff
FROM <table>
ORDER BY startdate ASC)
WHERE diff > 15
However, the following query gives me the exact same result and uses less memory.
SELECT lag, startdate, diff
FROM (SELECT startdate,
lagInFrame(startdate)
OVER(
ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND
CURRENT ROW) AS lag,
Date_diff('minute', lag, startdate) AS diff
FROM <table>
ORDER BY startdate ASC)
WHERE diff > 15
The difference is:
OVER(ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
vs
OVER(ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
Are there cases where the output of the queries will differ?
The clickhouse documentation states:
lagInFrame behavior differs from the standard SQL lag window function. Clickhouse window function lagInFrame respects the window frame.
What is this window frame, and how does it affect the output?
Example: I want to find rows in a time series where the difference in the time column value for two consecutive rows is greater than a given threshold.
I want to compare each row with the previous row.
The following leads me to believe that the query below is the correct approach.
To get behavior identical to the lag, use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
SELECT lag, startdate, diff
FROM (SELECT startdate,
lagInFrame(startdate)
OVER(
ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS lag,
Date_diff('minute', lag, startdate) AS diff
FROM <table>
ORDER BY startdate ASC)
WHERE diff > 15
However, the following query gives me the exact same result and uses less memory.
SELECT lag, startdate, diff
FROM (SELECT startdate,
lagInFrame(startdate)
OVER(
ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND
CURRENT ROW) AS lag,
Date_diff('minute', lag, startdate) AS diff
FROM <table>
ORDER BY startdate ASC)
WHERE diff > 15
The difference is:
OVER(ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
vs
OVER(ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
Are there cases where the output of the queries will differ?
Share Improve this question asked Feb 7 at 18:22 SoenderbySoenderby 4064 silver badges17 bronze badges1 Answer
Reset to default 1I wrote this documentation in such way to reduce the number of false bug reports about the leadInFrame
:
- https://github.com/ClickHouse/ClickHouse/issues/67744
- https://github.com/ClickHouse/ClickHouse/issues/72354
- https://github.com/ClickHouse/ClickHouse/issues/37946
People just don't understand how window function's frame works in case of OVER (ORDER BY somecol)
https://clickhouse.com/docs/en/sql-reference/window-functions#syntax
That the frame is bounded by the beginning of a partition and the current row.
And it's an equalent of ORDER BY order ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
.
And the next row is not visible for leadInFrame
.
It's more convenient for the usual people to use BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
, because it's more compatible with lag/lead
and works in all cases with any offset.
Are there cases where the output of the queries will differ?
For example if you use another offset: lagInFrame(startdate,2)
, then you need to use BETWEEN 2 PRECEDING AND CURRENT ROW
https://fiddle.clickhouse.com/1ce6547a-109a-4552-9f1b-b9c03f972988
BTW, you can use any() over ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
and will get the same result: https://fiddle.clickhouse.com/cfa1df1e-2571-4167-9fc1-0f3b1c8f86ac