最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

postgresql - What is the postgressql code to fill in missing dates - Stack Overflow

programmeradmin0浏览0评论
date region beverages_units
2020-06-01 SIN51 56
- DUB53 28
- IAD77 83
- PDX79 56
2020-06-02 SIN51 34
- DUB53 0
- IAD77 46
- PDX79 169
2020-06-03 SIN51 41
- DUB53 236
- IAD77 150
- PDX79 246
date region beverages_units
2020-06-01 SIN51 56
- DUB53 28
- IAD77 83
- PDX79 56
2020-06-02 SIN51 34
- DUB53 0
- IAD77 46
- PDX79 169
2020-06-03 SIN51 41
- DUB53 236
- IAD77 150
- PDX79 246
WITH cte1 AS (
    SELECT NULLIF(date, '-') AS dated,
    region,
    beverages_units
    FROM data4
),
cte2 AS (
    SELECT (COALESCE (dated, LAG(dated) OVER ())) AS filled,
    region,
    beverages_units
FROM cte1),

cte3 AS (
    SELECT (COALESCE(filled, LAG(filled) OVER ())) AS filled2,
    region,
    beverages_units
FROM cte2),

cte4 AS (
    SELECT (COALESCE(filled2, LAG(filled2) OVER ())) AS filled3,
    region,
    beverages_units
FROM cte3)

SELECT *
FROM cte4;

This solved my problem but is there an easier way of doing this. This was the desired output

date region beverages_units
2020-06-01 SIN51 56
2020-06-01 DUB53 28
2020-06-01 IAD77 83
2020-06-01 PDX79 56
2020-06-02 SIN51 34
2020-06-02 DUB53 0
2020-06-02 IAD77 46
2020-06-02 PDX79 169
2020-06-03 SIN51 41
2020-06-03 DUB53 236
2020-06-03 IAD77 150
2020-06-03 PDX79 246
Share Improve this question edited Feb 6 at 5:54 James Z 12.3k10 gold badges27 silver badges47 bronze badges asked Feb 5 at 17:43 Uchiha_ItachiUchiha_Itachi 111 silver badge2 bronze badges 2
  • 4 - is not a valid date. There is no reliable way to translate this to 2020-06-02. Because why should it be 2020-06-02 instead of 2020-06-03? Advice: fix your data. – Frank Heikens Commented Feb 5 at 17:46
  • Dates should be stored isng a DATE type. Avoid using reserved keywords as identifiers (e.g., don't name a column DATE). If order matters, then each row must have a column, or set of columns, that can be used to define a unique ordering. – JohnH Commented Feb 6 at 0:23
Add a comment  | 

3 Answers 3

Reset to default 1

I would first fix your data table as @FrankHeikens commented. Please note the addition of id since otherwise the order of records is undetermined.

CREATE TABLE the_table (
  id integer,
  "date" date,
  region text,
  beverages_units INTEGER
);

INSERT INTO the_table (id, "date", region, beverages_units)
select id, nullif("date", '-')::date, region, beverages_units
from
(
 VALUES
  (1, '2020-06-01', 'SIN51', 56),
  (2, '-', 'DUB53', 28),
  (3, '-', 'IAD77', 83),
  (4, '-', 'PDX79', 56),
  (5, '2020-06-02', 'SIN51', 34),
  (6, '-', 'DUB53', 0),
  (7, '-', 'IAD77', 46),
  (8, '-', 'PDX79', 169),
  (9, '2020-06-03', 'SIN51', 41),
  (10, '-', 'DUB53', 236),
  (11, '-', 'IAD77', 150),
  (12, '-', 'PDX79', 246)
)
as t(id, "date", region, beverages_units);

Then use a cursor (maybe suboptimal) in a simple helper function.

create or replace function fix_it()
returns setof the_table language plpgsql as
$$
declare 
 r the_table%rowtype;
 v_date text;
begin
 for r in select * from the_table order by id loop
    if r.date is not null then 
        v_date := r.date;
    else
        r.date := v_date;
    end if;
    return next r;
 end loop;        
end;
$$;

Finally

select * from fix_it() order by id;

DB-Fiddle demo

Without a means to order rows deterministically, it's not possible to achieve the OP's desired results. I begin by setting up data corresponding to the OP's example (note that I've added an identity column, renamed date to reference_date to avoid using a reserved keyword as an identifier, and declared reference_date to be of type DATE instead of as a text type).

CREATE TABLE data4 (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  reference_date date,
  region TEXT,
  beverages_units INTEGER
);

-- insert data as per OP's example
INSERT INTO
  data4 (id, reference_date, region, beverages_units) OVERRIDING SYSTEM VALUE
SELECT
  id,
  NULLIF(reference_date, '-')::DATE,
  region,
  beverages_units
FROM
  (
    VALUES
      (1, '2020-06-01', 'SIN51', 56),
      (2, '-', 'DUB53', 28),
      (3, '-', 'IAD77', 83),
      (4, '-', 'PDX79', 56),
      (5, '2020-06-02', 'SIN51', 34),
      (6, '-', 'DUB53', 0),
      (7, '-', 'IAD77', 46),
      (8, '-', 'PDX79', 169),
      (9, '2020-06-03', 'SIN51', 41),
      (10, '-', 'DUB53', 236),
      (11, '-', 'IAD77', 150),
      (12, '-', 'PDX79', 246)
  ) AS t (id, reference_date, region, beverages_units);

-- synchronize the ID generation sequence with the inserted data
SELECT
  SETVAL(PG_GET_SERIAL_SEQUENCE('data4', 'id'), MAX(id))
FROM
  data4;

The query that the OP claimed solved the problem appears to work; however, it only returned the desired results because the rows happened to be returned in the order in which they were inserted. An easy way to demonstrate this is to update a row after it has been inserted. The OP's proposed query (slightly modified to be consistent with the changes described above) is:

WITH
  cte1 AS (
    SELECT
      reference_date AS dated,
      region,
      beverages_units
    FROM
      data4
  ),
  cte2 AS (
    SELECT
      (COALESCE(dated, LAG(dated) OVER ())) AS filled,
      region,
      beverages_units
    FROM
      cte1
  ),
  cte3 AS (
    SELECT
      (COALESCE(filled, LAG(filled) OVER ())) AS filled2,
      region,
      beverages_units
    FROM
      cte2
  ),
  cte4 AS (
    SELECT
      (COALESCE(filled2, LAG(filled2) OVER ())) AS filled3,
      region,
      beverages_units
    FROM
      cte3
  )
SELECT
  *
FROM
  cte4;

Running this query immediately after inserting the data resulted in the following:

filled3 region beverages_units
2020-06-01 SIN51 56
2020-06-01 DUB53 28
2020-06-01 IAD77 83
2020-06-01 PDX79 56
2020-06-02 SIN51 34
2020-06-02 DUB53 0
2020-06-02 IAD77 46
2020-06-02 PDX79 169
2020-06-03 SIN51 41
2020-06-03 DUB53 236
2020-06-03 IAD77 150
2020-06-03 PDX79 246

After executing this update (which doesn't change any row's data)

UPDATE data4
SET
  region = region
WHERE
  id = 2;

and then repeating the query, the following was returned (the first group only has three rows and the updated row is no longer associated with a date):

filled3 region beverages_units
2020-06-01 SIN51 56
2020-06-01 IAD77 83
2020-06-01 PDX79 56
2020-06-02 SIN51 34
2020-06-02 DUB53 0
2020-06-02 IAD77 46
2020-06-02 PDX79 169
2020-06-03 SIN51 41
2020-06-03 DUB53 236
2020-06-03 IAD77 150
2020-06-03 PDX79 246
NULL DUB53 28

The following query depends on a column, id, that can be used to sort the data:

WITH
  -- assign groups based on id order and rows with associated dates
  grouped AS (
    SELECT
      id,
      reference_date,
      SUM(1) FILTER (WHERE reference_date IS NOT NULL) OVER (ORDER BY id) AS group_id,
      region,
      beverages_units
    FROM
      data4
  )
  -- use the same date for all rows within a group
SELECT
  id,
  MAX(reference_date) OVER (
    PARTITION BY
      group_id
  ) AS reference_date,
  region,
  beverages_units
FROM
  grouped
ORDER BY
  id;

This query will reliably return the correct results:

id reference_date region beverages_units
1 2020-06-01 SIN51 56
2 2020-06-01 DUB53 28
3 2020-06-01 IAD77 83
4 2020-06-01 PDX79 56
5 2020-06-02 SIN51 34
6 2020-06-02 DUB53 0
7 2020-06-02 IAD77 46
8 2020-06-02 PDX79 169
9 2020-06-03 SIN51 41
10 2020-06-03 DUB53 236
11 2020-06-03 IAD77 150
12 2020-06-03 PDX79 246

This can be accomplished with a fairly simple query. Basically you have 2 CTE. The first identities the rows having a valid date. The second takes consecutive rows from the first and creates a range for the rows to updated. This process does require a unique id. Since you contend sometimes we receive the data as is this is all caried out ina a staging table. Once corrected the undated data is copied to the permanent table. (See demo here)

with fix_dt(fdt, fid, rn ) as              -- cte to identify rows with valid dates
     ( select dt, id, row_number() over() 
        from stg_table
     where dt is not null
     order by id
     ) -- select * from fix_dt
   , rfix(fdt,rid) as                      -- cte to transform consective dated rows to range of ids 
     (select fdt
           , int8range(fid, lead(fid) over(order by fid), '()')
        from fix_dt
     )  -- select * from rfix;
update stg_table                            -- apply the date to rows missing it 
   set dt = fdt
  from rfix
 where id  <@ rid;

The above and the demo both assume the use of a staging table, however, I strongly recommend you create an id in your permanent table, just because you do not receive one. If fact if you do receive one do not use it for this purpose as you do not control it. Do not rely on your data you do not control to control your process.

发布评论

评论列表(0)

  1. 暂无评论