最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

database - How to retrieve the latest version of records for each time period in SQL when there are multiple versions of the sam

programmeradmin2浏览0评论

Community!

I have a table with periods (from_date and to_date columns) and associated values, and I need to retrieve the latest records. The complication is that there are multiple versions of the same period due to updates over time, and each record includes a creation_date that determines the versioning.

Here are the details of my table schema:

CREATE TABLE tab (
    id INTEGER, 
    from_date DATE,
    to_date DATE,
    creation_date TIMESTAMP,
    value INTEGER,
    from_timestamp TIMESTAMP,
    to_timestamp TIMESTAMP
);

In this table, the same period can have different values at different times. For example:

id from_date to_date creation_date value from_timestamp to_timestamp
1 2014-10-01 9999-12-31 2014-10-01 10:00:05 100 2014-10-01 00:00:00 9999-12-31 00:00:00
2 2014-10-01 2016-08-10 2015-10-01 10:00:05 100 2014-10-01 00:00:00 2016-08-11 00:00:00
3 2016-08-11 9999-12-31 2015-10-01 10:00:05 120 2016-08-11 00:00:00 9999-12-31 00:00:00
4 2014-10-01 9999-12-31 2016-10-01 10:00:05 100 2014-10-01 00:00:00 9999-12-31 00:00:00
5 2014-10-01 9999-12-31 2017-10-01 10:00:05 200 2014-10-01 00:00:00 9999-12-31 00:00:00
6 2014-10-01 2016-08-10 2018-10-01 10:00:05 200 2014-10-01 00:00:00 2016-08-11 00:00:00
7 2016-08-11 9999-12-31 2018-10-01 10:00:05 300 2016-08-11 00:00:00 9999-12-31 00:00:00
8 2014-10-01 2016-09-10 2019-10-01 10:00:05 200 2014-10-01 00:00:00 2016-09-11 00:00:00
9 2016-09-11 2021-01-20 2019-10-01 10:00:05 300 2016-09-11 00:00:00 2021-01-21 00:00:00
10 2021-01-21 9999-12-31 2019-10-01 10:00:05 350 2021-01-21 00:00:00 9999-12-31 00:00:00

Community!

I have a table with periods (from_date and to_date columns) and associated values, and I need to retrieve the latest records. The complication is that there are multiple versions of the same period due to updates over time, and each record includes a creation_date that determines the versioning.

Here are the details of my table schema:

CREATE TABLE tab (
    id INTEGER, 
    from_date DATE,
    to_date DATE,
    creation_date TIMESTAMP,
    value INTEGER,
    from_timestamp TIMESTAMP,
    to_timestamp TIMESTAMP
);

In this table, the same period can have different values at different times. For example:

id from_date to_date creation_date value from_timestamp to_timestamp
1 2014-10-01 9999-12-31 2014-10-01 10:00:05 100 2014-10-01 00:00:00 9999-12-31 00:00:00
2 2014-10-01 2016-08-10 2015-10-01 10:00:05 100 2014-10-01 00:00:00 2016-08-11 00:00:00
3 2016-08-11 9999-12-31 2015-10-01 10:00:05 120 2016-08-11 00:00:00 9999-12-31 00:00:00
4 2014-10-01 9999-12-31 2016-10-01 10:00:05 100 2014-10-01 00:00:00 9999-12-31 00:00:00
5 2014-10-01 9999-12-31 2017-10-01 10:00:05 200 2014-10-01 00:00:00 9999-12-31 00:00:00
6 2014-10-01 2016-08-10 2018-10-01 10:00:05 200 2014-10-01 00:00:00 2016-08-11 00:00:00
7 2016-08-11 9999-12-31 2018-10-01 10:00:05 300 2016-08-11 00:00:00 9999-12-31 00:00:00
8 2014-10-01 2016-09-10 2019-10-01 10:00:05 200 2014-10-01 00:00:00 2016-09-11 00:00:00
9 2016-09-11 2021-01-20 2019-10-01 10:00:05 300 2016-09-11 00:00:00 2021-01-21 00:00:00
10 2021-01-21 9999-12-31 2019-10-01 10:00:05 350 2021-01-21 00:00:00 9999-12-31 00:00:00
The value 100 was initially assigned to the period 2014-10-01 to 9999-12-31 (id 1).

Then, the value was updated to 100 for the period 2014-10-01 to 2016-08-11 and 120 for the period 2016-08-11 to 9999-12-31 (id 2 and 3).

The value 100 was chosen again for the period 2014-10-01 to 9999-12-31 (id 4).

Then, value 200 was chosen for the period 2014-10-01 to 9999-12-31 (id 5).

Changes in values also occurred for the periods 2014-10-01 to 2016-08-10 and 2016-08-11 to 9999-12-31 (id 6-7).

Finally, value 200 was assigned for the period 2014-10-01 - 2016-09-10
The value 300 was assigned to the period 2016-09-11 to 2021-01-20 (id 9).
The value 350 was assigned to the period 2021-01-21 to 9999-12-31 (id 10).

Similarly, other periods have had their values updated over time.

What I want to achieve: I need to write a query that will return the most recent version of the records.

What I have tried: I tried using the ROW_NUMBER() function with PARTITION BY to partition the records by from_date and to_date, but no success.

Share Improve this question edited Mar 31 at 10:37 Guillaume Outters 2,5321 gold badge17 silver badges20 bronze badges asked Mar 31 at 6:41 foxbuurfoxbuur 1692 silver badges9 bronze badges 2
  • select * from .... group by .... where .... having period_col = max(period_col) ? – mr mcwolf Commented Mar 31 at 6:50
  • Note the two answers produce very different results, see this sample fiddle. One might be correct, maybe none. I wonder why people think they should answer such unclear questions, anyway, it doesn't really help other readers. Therefore, questions should always include the expected result as table. No one knows how you define "latest records". – Jonas Metzler Commented Mar 31 at 7:28
Add a comment  | 

3 Answers 3

Reset to default 1

You’re on the right track going with ROW_NUMBER(). But the key here is to partition by the period (from_date, to_date) and order by creation_date descending to get the latest version.. Try this:

WITH RankedRecords AS (
    SELECT 
        id, 
        from_date, 
        to_date, 
        creation_date, 
        value, 
        from_timestamp, 
        to_timestamp,
        ROW_NUMBER() OVER (
            PARTITION BY from_date, to_date 
            ORDER BY creation_date DESC
        ) AS rn
    FROM tab
)
SELECT id, from_date, to_date, creation_date, value, from_timestamp, to_timestamp
FROM RankedRecords
WHERE rn = 1;

The query below finds all records created at the latest timestamp (creation_date), which represent the most recent version of your data. It works by:

  1. Identifying the latest creation_date (e.g., 2019-10-01 10:00:05 in your example).

  2. Returning all records created at that timestamp (IDs 8, 9, 10 in your example), which form the current state of your periods.

    Good luck


WITH latest_creation AS (
    SELECT MAX(creation_date) AS max_creation_date
    FROM tab
)
SELECT *
FROM tab
WHERE creation_date = (SELECT max_creation_date FROM latest_creation)
ORDER BY from_timestamp;

As I understand, in your example you want to return 3 rows (8, 9, 10) because they constitute a non-separatable set which has been input together.

I'm surprised not to see any name for the set (to distinguish those 8, 9, 10 from another set whose history would be stored in the same table),
so I'll add a set_name to this effect in my solutions;
if not needed (if only one set of values has its history in this table), you can remove it (it's always easier to remove something than to add).

With rank()

Your row_number() is a good start, however only 1 of your 3 desired rows will get the 1st place; the 2 others will (arbitrarily) get row_number()s 2 and 3.

What you want is rank(), which will return 1 for the three rows 8, 9 and 10 (and then the second set (IDs 6 and 7) will get rank() 4, accounting for the 3 first tie places 1 taken by IDs 8, 9 and 10;
if you need to have the second set be numbered 2, for example to display the "previous set", you'll replace rank() with dense_rank()).

with placed as (select tab.*, rank() over (partition by set_name order by creation_date desc) freshness from tab)
select * from placed where freshness = 1;

Although your use case can afford the simplicity of the given query (where placed is a full clone of tab, with an added column),
whenever you will want to generalize this technique on a huge data set,
you may want to benefit from your unambiguous IDs to have placed only contain the minimal info to then retrieve the full row:

with placed as (select id, rank() over (partition by set_name order by creation_date desc) freshness from tab)
select tab.* from placed join tab on tab.id = placed.id where freshness = 1;

(well, by specifying only the columns you need, you're not really deciding which data gets transfered, you're just hinting and hopefully guiding the database optimizer in a "a bit more predictable" direction;
but in the end, it's the optimizer that will decide… and perhaps it would even have done it right with the "non guided" query, without our hint)

With creation_date = max(creation_date) over the set

Another option would be first determine the most recent creation_date from the set's history,
then filtering on entries having this creation_date (and set_name of course).

select * from tab where (set_name, creation_date) in (select set_name, max(creation_date) latest from tab group by set_name);

Although a window function would look tempting, with something like where creation_date = max(creation_date) over (…),
by definition a window function cannot be applied to the where so no advantage here, you would need a two-pass query as every other solution:

-- Emulate the "idealized" query:
-- select * from tab having creation_date = max(creation_date) over (partition by set_name order by creation_date desc);
select * from
(
    select tab.*, case when creation_date = max(creation_date) over (partition by set_name order by creation_date desc) then 1 end latest from tab
)
matching where latest = 1;

Choosing

I personally would choose rank() with ids, because once I have chosen I just want to fetch with simple boolean tests (whereas with the max() we're re-comparing dates).

The 4 variants are presented in a fiddle.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论