postgresql - What is the postgressql code to fill in missing dates

date	region	beverages_units
2020-06-01	SIN51	56
-	DUB53	28
-	IAD77	83
-	PDX79	56
2020-06-02	SIN51	34
-	DUB53	0
-	IAD77	46
-	PDX79	169
2020-06-03	SIN51	41
-	DUB53	236
-	IAD77	150
-	PDX79	246

date	region	beverages_units
2020-06-01	SIN51	56
-	DUB53	28
-	IAD77	83
-	PDX79	56
2020-06-02	SIN51	34
-	DUB53	0
-	IAD77	46
-	PDX79	169
2020-06-03	SIN51	41
-	DUB53	236
-	IAD77	150
-	PDX79	246

WITH cte1 AS (
    SELECT NULLIF(date, '-') AS dated,
    region,
    beverages_units
    FROM data4
),
cte2 AS (
    SELECT (COALESCE (dated, LAG(dated) OVER ())) AS filled,
    region,
    beverages_units
FROM cte1),

cte3 AS (
    SELECT (COALESCE(filled, LAG(filled) OVER ())) AS filled2,
    region,
    beverages_units
FROM cte2),

cte4 AS (
    SELECT (COALESCE(filled2, LAG(filled2) OVER ())) AS filled3,
    region,
    beverages_units
FROM cte3)

SELECT *
FROM cte4;

This solved my problem but is there an easier way of doing this. This was the desired output

date	region	beverages_units
2020-06-01	SIN51	56
2020-06-01	DUB53	28
2020-06-01	IAD77	83
2020-06-01	PDX79	56
2020-06-02	SIN51	34
2020-06-02	DUB53	0
2020-06-02	IAD77	46
2020-06-02	PDX79	169
2020-06-03	SIN51	41
2020-06-03	DUB53	236
2020-06-03	IAD77	150
2020-06-03	PDX79	246

Share Improve this question edited Feb 6 at 5:54 James Z 12.3k10 gold badges27 silver badges47 bronze badges asked Feb 5 at 17:43 Uchiha_Itachi 111 silver badge2 bronze badges

4 - is not a valid date. There is no reliable way to translate this to 2020-06-02. Because why should it be 2020-06-02 instead of 2020-06-03? Advice: fix your data. – Frank Heikens Commented Feb 5 at 17:46
Dates should be stored isng a DATE type. Avoid using reserved keywords as identifiers (e.g., don't name a column DATE). If order matters, then each row must have a column, or set of columns, that can be used to define a unique ordering. – JohnH Commented Feb 6 at 0:23

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

I would first fix your data table as @FrankHeikens commented. Please note the addition of id since otherwise the order of records is undetermined.

CREATE TABLE the_table (
  id integer,
  "date" date,
  region text,
  beverages_units INTEGER
);

INSERT INTO the_table (id, "date", region, beverages_units)
select id, nullif("date", '-')::date, region, beverages_units
from
(
 VALUES
  (1, '2020-06-01', 'SIN51', 56),
  (2, '-', 'DUB53', 28),
  (3, '-', 'IAD77', 83),
  (4, '-', 'PDX79', 56),
  (5, '2020-06-02', 'SIN51', 34),
  (6, '-', 'DUB53', 0),
  (7, '-', 'IAD77', 46),
  (8, '-', 'PDX79', 169),
  (9, '2020-06-03', 'SIN51', 41),
  (10, '-', 'DUB53', 236),
  (11, '-', 'IAD77', 150),
  (12, '-', 'PDX79', 246)
)
as t(id, "date", region, beverages_units);

Then use a cursor (maybe suboptimal) in a simple helper function.

create or replace function fix_it()
returns setof the_table language plpgsql as
$$
declare 
 r the_table%rowtype;
 v_date text;
begin
 for r in select * from the_table order by id loop
    if r.date is not null then 
        v_date := r.date;
    else
        r.date := v_date;
    end if;
    return next r;
 end loop;        
end;
$$;

Finally

select * from fix_it() order by id;

DB-Fiddle demo

Without a means to order rows deterministically, it's not possible to achieve the OP's desired results. I begin by setting up data corresponding to the OP's example (note that I've added an identity column, renamed date to reference_date to avoid using a reserved keyword as an identifier, and declared reference_date to be of type DATE instead of as a text type).

CREATE TABLE data4 (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  reference_date date,
  region TEXT,
  beverages_units INTEGER
);

-- insert data as per OP's example
INSERT INTO
  data4 (id, reference_date, region, beverages_units) OVERRIDING SYSTEM VALUE
SELECT
  id,
  NULLIF(reference_date, '-')::DATE,
  region,
  beverages_units
FROM
  (
    VALUES
      (1, '2020-06-01', 'SIN51', 56),
      (2, '-', 'DUB53', 28),
      (3, '-', 'IAD77', 83),
      (4, '-', 'PDX79', 56),
      (5, '2020-06-02', 'SIN51', 34),
      (6, '-', 'DUB53', 0),
      (7, '-', 'IAD77', 46),
      (8, '-', 'PDX79', 169),
      (9, '2020-06-03', 'SIN51', 41),
      (10, '-', 'DUB53', 236),
      (11, '-', 'IAD77', 150),
      (12, '-', 'PDX79', 246)
  ) AS t (id, reference_date, region, beverages_units);

-- synchronize the ID generation sequence with the inserted data
SELECT
  SETVAL(PG_GET_SERIAL_SEQUENCE('data4', 'id'), MAX(id))
FROM
  data4;

The query that the OP claimed solved the problem appears to work; however, it only returned the desired results because the rows happened to be returned in the order in which they were inserted. An easy way to demonstrate this is to update a row after it has been inserted. The OP's proposed query (slightly modified to be consistent with the changes described above) is:

WITH
  cte1 AS (
    SELECT
      reference_date AS dated,
      region,
      beverages_units
    FROM
      data4
  ),
  cte2 AS (
    SELECT
      (COALESCE(dated, LAG(dated) OVER ())) AS filled,
      region,
      beverages_units
    FROM
      cte1
  ),
  cte3 AS (
    SELECT
      (COALESCE(filled, LAG(filled) OVER ())) AS filled2,
      region,
      beverages_units
    FROM
      cte2
  ),
  cte4 AS (
    SELECT
      (COALESCE(filled2, LAG(filled2) OVER ())) AS filled3,
      region,
      beverages_units
    FROM
      cte3
  )
SELECT
  *
FROM
  cte4;

Running this query immediately after inserting the data resulted in the following:

filled3	region	beverages_units
2020-06-01	SIN51	56
2020-06-01	DUB53	28
2020-06-01	IAD77	83
2020-06-01	PDX79	56
2020-06-02	SIN51	34
2020-06-02	DUB53	0
2020-06-02	IAD77	46
2020-06-02	PDX79	169
2020-06-03	SIN51	41
2020-06-03	DUB53	236
2020-06-03	IAD77	150
2020-06-03	PDX79	246

After executing this update (which doesn't change any row's data)

UPDATE data4
SET
  region = region
WHERE
  id = 2;

and then repeating the query, the following was returned (the first group only has three rows and the updated row is no longer associated with a date):

filled3	region	beverages_units
2020-06-01	SIN51	56
2020-06-01	IAD77	83
2020-06-01	PDX79	56
2020-06-02	SIN51	34
2020-06-02	DUB53	0
2020-06-02	IAD77	46
2020-06-02	PDX79	169
2020-06-03	SIN51	41
2020-06-03	DUB53	236
2020-06-03	IAD77	150
2020-06-03	PDX79	246
NULL	DUB53	28

The following query depends on a column, id, that can be used to sort the data:

WITH
  -- assign groups based on id order and rows with associated dates
  grouped AS (
    SELECT
      id,
      reference_date,
      SUM(1) FILTER (WHERE reference_date IS NOT NULL) OVER (ORDER BY id) AS group_id,
      region,
      beverages_units
    FROM
      data4
  )
  -- use the same date for all rows within a group
SELECT
  id,
  MAX(reference_date) OVER (
    PARTITION BY
      group_id
  ) AS reference_date,
  region,
  beverages_units
FROM
  grouped
ORDER BY
  id;

This query will reliably return the correct results:

id	reference_date	region	beverages_units
1	2020-06-01	SIN51	56
2	2020-06-01	DUB53	28
3	2020-06-01	IAD77	83
4	2020-06-01	PDX79	56
5	2020-06-02	SIN51	34
6	2020-06-02	DUB53	0
7	2020-06-02	IAD77	46
8	2020-06-02	PDX79	169
9	2020-06-03	SIN51	41
10	2020-06-03	DUB53	236
11	2020-06-03	IAD77	150
12	2020-06-03	PDX79	246

This can be accomplished with a fairly simple query. Basically you have 2 CTE. The first identities the rows having a valid date. The second takes consecutive rows from the first and creates a range for the rows to updated. This process does require a unique id. Since you contend sometimes we receive the data as is this is all caried out ina a staging table. Once corrected the undated data is copied to the permanent table. (See demo here)

with fix_dt(fdt, fid, rn ) as              -- cte to identify rows with valid dates
     ( select dt, id, row_number() over() 
        from stg_table
     where dt is not null
     order by id
     ) -- select * from fix_dt
   , rfix(fdt,rid) as                      -- cte to transform consective dated rows to range of ids 
     (select fdt
           , int8range(fid, lead(fid) over(order by fid), '()')
        from fix_dt
     )  -- select * from rfix;
update stg_table                            -- apply the date to rows missing it 
   set dt = fdt
  from rfix
 where id  <@ rid;

The above and the demo both assume the use of a staging table, however, I strongly recommend you create an id in your permanent table, just because you do not receive one. If fact if you do receive one do not use it for this purpose as you do not control it. Do not rely on your data you do not control to control your process.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

postgresql - What is the postgressql code to fill in missing dates - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)