最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

clickhouse - When is lagInFrame behavior different from standard SQL LAG? - Stack Overflow

programmeradmin1浏览0评论

The clickhouse documentation states:

lagInFrame behavior differs from the standard SQL lag window function. Clickhouse window function lagInFrame respects the window frame.

What is this window frame, and how does it affect the output?

Example: I want to find rows in a time series where the difference in the time column value for two consecutive rows is greater than a given threshold.

I want to compare each row with the previous row.

The following leads me to believe that the query below is the correct approach.

To get behavior identical to the lag, use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

SELECT lag, startdate, diff
FROM  (SELECT startdate,
              lagInFrame(startdate)
                OVER(
                  ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND
                  UNBOUNDED FOLLOWING) AS lag,
              Date_diff('minute', lag, startdate) AS diff
       FROM   <table>
       ORDER  BY startdate ASC)
WHERE  diff > 15 

However, the following query gives me the exact same result and uses less memory.

SELECT lag, startdate, diff
FROM  (SELECT startdate,
              lagInFrame(startdate)
                OVER(
                  ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND
                  CURRENT ROW) AS lag,
              Date_diff('minute', lag, startdate) AS diff
       FROM   <table>
       ORDER  BY startdate ASC)
WHERE  diff > 15 

The difference is:
OVER(ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
vs
OVER(ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

Are there cases where the output of the queries will differ?

The clickhouse documentation states:

lagInFrame behavior differs from the standard SQL lag window function. Clickhouse window function lagInFrame respects the window frame.

What is this window frame, and how does it affect the output?

Example: I want to find rows in a time series where the difference in the time column value for two consecutive rows is greater than a given threshold.

I want to compare each row with the previous row.

The following leads me to believe that the query below is the correct approach.

To get behavior identical to the lag, use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

SELECT lag, startdate, diff
FROM  (SELECT startdate,
              lagInFrame(startdate)
                OVER(
                  ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND
                  UNBOUNDED FOLLOWING) AS lag,
              Date_diff('minute', lag, startdate) AS diff
       FROM   <table>
       ORDER  BY startdate ASC)
WHERE  diff > 15 

However, the following query gives me the exact same result and uses less memory.

SELECT lag, startdate, diff
FROM  (SELECT startdate,
              lagInFrame(startdate)
                OVER(
                  ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND
                  CURRENT ROW) AS lag,
              Date_diff('minute', lag, startdate) AS diff
       FROM   <table>
       ORDER  BY startdate ASC)
WHERE  diff > 15 

The difference is:
OVER(ORDER BY startdate ASC ROWS BETWEEN UBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
vs
OVER(ORDER BY startdate ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)

Are there cases where the output of the queries will differ?

Share Improve this question asked Feb 7 at 18:22 SoenderbySoenderby 4064 silver badges17 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 1

I wrote this documentation in such way to reduce the number of false bug reports about the leadInFrame:

  • https://github.com/ClickHouse/ClickHouse/issues/67744
  • https://github.com/ClickHouse/ClickHouse/issues/72354
  • https://github.com/ClickHouse/ClickHouse/issues/37946

People just don't understand how window function's frame works in case of OVER (ORDER BY somecol) https://clickhouse.com/docs/en/sql-reference/window-functions#syntax

That the frame is bounded by the beginning of a partition and the current row. And it's an equalent of ORDER BY order ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

And the next row is not visible for leadInFrame.

It's more convenient for the usual people to use BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, because it's more compatible with lag/lead and works in all cases with any offset.


Are there cases where the output of the queries will differ?

For example if you use another offset: lagInFrame(startdate,2), then you need to use BETWEEN 2 PRECEDING AND CURRENT ROW https://fiddle.clickhouse.com/1ce6547a-109a-4552-9f1b-b9c03f972988


BTW, you can use any() over ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING and will get the same result: https://fiddle.clickhouse.com/cfa1df1e-2571-4167-9fc1-0f3b1c8f86ac

发布评论

评论列表(0)

  1. 暂无评论