date | region | beverages_units |
---|---|---|
2020-06-01 | SIN51 | 56 |
- | DUB53 | 28 |
- | IAD77 | 83 |
- | PDX79 | 56 |
2020-06-02 | SIN51 | 34 |
- | DUB53 | 0 |
- | IAD77 | 46 |
- | PDX79 | 169 |
2020-06-03 | SIN51 | 41 |
- | DUB53 | 236 |
- | IAD77 | 150 |
- | PDX79 | 246 |
date | region | beverages_units |
---|---|---|
2020-06-01 | SIN51 | 56 |
- | DUB53 | 28 |
- | IAD77 | 83 |
- | PDX79 | 56 |
2020-06-02 | SIN51 | 34 |
- | DUB53 | 0 |
- | IAD77 | 46 |
- | PDX79 | 169 |
2020-06-03 | SIN51 | 41 |
- | DUB53 | 236 |
- | IAD77 | 150 |
- | PDX79 | 246 |
WITH cte1 AS (
SELECT NULLIF(date, '-') AS dated,
region,
beverages_units
FROM data4
),
cte2 AS (
SELECT (COALESCE (dated, LAG(dated) OVER ())) AS filled,
region,
beverages_units
FROM cte1),
cte3 AS (
SELECT (COALESCE(filled, LAG(filled) OVER ())) AS filled2,
region,
beverages_units
FROM cte2),
cte4 AS (
SELECT (COALESCE(filled2, LAG(filled2) OVER ())) AS filled3,
region,
beverages_units
FROM cte3)
SELECT *
FROM cte4;
This solved my problem but is there an easier way of doing this. This was the desired output
date | region | beverages_units |
---|---|---|
2020-06-01 | SIN51 | 56 |
2020-06-01 | DUB53 | 28 |
2020-06-01 | IAD77 | 83 |
2020-06-01 | PDX79 | 56 |
2020-06-02 | SIN51 | 34 |
2020-06-02 | DUB53 | 0 |
2020-06-02 | IAD77 | 46 |
2020-06-02 | PDX79 | 169 |
2020-06-03 | SIN51 | 41 |
2020-06-03 | DUB53 | 236 |
2020-06-03 | IAD77 | 150 |
2020-06-03 | PDX79 | 246 |
3 Answers
Reset to default 1I would first fix your data table as @FrankHeikens commented. Please note the addition of id
since otherwise the order of records is undetermined.
CREATE TABLE the_table (
id integer,
"date" date,
region text,
beverages_units INTEGER
);
INSERT INTO the_table (id, "date", region, beverages_units)
select id, nullif("date", '-')::date, region, beverages_units
from
(
VALUES
(1, '2020-06-01', 'SIN51', 56),
(2, '-', 'DUB53', 28),
(3, '-', 'IAD77', 83),
(4, '-', 'PDX79', 56),
(5, '2020-06-02', 'SIN51', 34),
(6, '-', 'DUB53', 0),
(7, '-', 'IAD77', 46),
(8, '-', 'PDX79', 169),
(9, '2020-06-03', 'SIN51', 41),
(10, '-', 'DUB53', 236),
(11, '-', 'IAD77', 150),
(12, '-', 'PDX79', 246)
)
as t(id, "date", region, beverages_units);
Then use a cursor (maybe suboptimal) in a simple helper function.
create or replace function fix_it()
returns setof the_table language plpgsql as
$$
declare
r the_table%rowtype;
v_date text;
begin
for r in select * from the_table order by id loop
if r.date is not null then
v_date := r.date;
else
r.date := v_date;
end if;
return next r;
end loop;
end;
$$;
Finally
select * from fix_it() order by id;
DB-Fiddle demo
Without a means to order rows deterministically, it's not possible to achieve the OP's desired results. I begin by setting up data corresponding to the OP's example (note that I've added an identity column, renamed date
to reference_date
to avoid using a reserved keyword as an identifier, and declared reference_date
to be of type DATE
instead of as a text type).
CREATE TABLE data4 (
id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
reference_date date,
region TEXT,
beverages_units INTEGER
);
-- insert data as per OP's example
INSERT INTO
data4 (id, reference_date, region, beverages_units) OVERRIDING SYSTEM VALUE
SELECT
id,
NULLIF(reference_date, '-')::DATE,
region,
beverages_units
FROM
(
VALUES
(1, '2020-06-01', 'SIN51', 56),
(2, '-', 'DUB53', 28),
(3, '-', 'IAD77', 83),
(4, '-', 'PDX79', 56),
(5, '2020-06-02', 'SIN51', 34),
(6, '-', 'DUB53', 0),
(7, '-', 'IAD77', 46),
(8, '-', 'PDX79', 169),
(9, '2020-06-03', 'SIN51', 41),
(10, '-', 'DUB53', 236),
(11, '-', 'IAD77', 150),
(12, '-', 'PDX79', 246)
) AS t (id, reference_date, region, beverages_units);
-- synchronize the ID generation sequence with the inserted data
SELECT
SETVAL(PG_GET_SERIAL_SEQUENCE('data4', 'id'), MAX(id))
FROM
data4;
The query that the OP claimed solved the problem appears to work; however, it only returned the desired results because the rows happened to be returned in the order in which they were inserted. An easy way to demonstrate this is to update a row after it has been inserted. The OP's proposed query (slightly modified to be consistent with the changes described above) is:
WITH
cte1 AS (
SELECT
reference_date AS dated,
region,
beverages_units
FROM
data4
),
cte2 AS (
SELECT
(COALESCE(dated, LAG(dated) OVER ())) AS filled,
region,
beverages_units
FROM
cte1
),
cte3 AS (
SELECT
(COALESCE(filled, LAG(filled) OVER ())) AS filled2,
region,
beverages_units
FROM
cte2
),
cte4 AS (
SELECT
(COALESCE(filled2, LAG(filled2) OVER ())) AS filled3,
region,
beverages_units
FROM
cte3
)
SELECT
*
FROM
cte4;
Running this query immediately after inserting the data resulted in the following:
filled3 | region | beverages_units |
---|---|---|
2020-06-01 | SIN51 | 56 |
2020-06-01 | DUB53 | 28 |
2020-06-01 | IAD77 | 83 |
2020-06-01 | PDX79 | 56 |
2020-06-02 | SIN51 | 34 |
2020-06-02 | DUB53 | 0 |
2020-06-02 | IAD77 | 46 |
2020-06-02 | PDX79 | 169 |
2020-06-03 | SIN51 | 41 |
2020-06-03 | DUB53 | 236 |
2020-06-03 | IAD77 | 150 |
2020-06-03 | PDX79 | 246 |
After executing this update (which doesn't change any row's data)
UPDATE data4
SET
region = region
WHERE
id = 2;
and then repeating the query, the following was returned (the first group only has three rows and the updated row is no longer associated with a date):
filled3 | region | beverages_units |
---|---|---|
2020-06-01 | SIN51 | 56 |
2020-06-01 | IAD77 | 83 |
2020-06-01 | PDX79 | 56 |
2020-06-02 | SIN51 | 34 |
2020-06-02 | DUB53 | 0 |
2020-06-02 | IAD77 | 46 |
2020-06-02 | PDX79 | 169 |
2020-06-03 | SIN51 | 41 |
2020-06-03 | DUB53 | 236 |
2020-06-03 | IAD77 | 150 |
2020-06-03 | PDX79 | 246 |
NULL | DUB53 | 28 |
The following query depends on a column, id
, that can be used to sort the data:
WITH
-- assign groups based on id order and rows with associated dates
grouped AS (
SELECT
id,
reference_date,
SUM(1) FILTER (WHERE reference_date IS NOT NULL) OVER (ORDER BY id) AS group_id,
region,
beverages_units
FROM
data4
)
-- use the same date for all rows within a group
SELECT
id,
MAX(reference_date) OVER (
PARTITION BY
group_id
) AS reference_date,
region,
beverages_units
FROM
grouped
ORDER BY
id;
This query will reliably return the correct results:
id | reference_date | region | beverages_units |
---|---|---|---|
1 | 2020-06-01 | SIN51 | 56 |
2 | 2020-06-01 | DUB53 | 28 |
3 | 2020-06-01 | IAD77 | 83 |
4 | 2020-06-01 | PDX79 | 56 |
5 | 2020-06-02 | SIN51 | 34 |
6 | 2020-06-02 | DUB53 | 0 |
7 | 2020-06-02 | IAD77 | 46 |
8 | 2020-06-02 | PDX79 | 169 |
9 | 2020-06-03 | SIN51 | 41 |
10 | 2020-06-03 | DUB53 | 236 |
11 | 2020-06-03 | IAD77 | 150 |
12 | 2020-06-03 | PDX79 | 246 |
This can be accomplished with a fairly simple query. Basically you have 2 CTE. The first identities the rows having a valid date. The second takes consecutive rows from the first and creates a range for the rows to updated. This process does require a unique id. Since you contend sometimes we receive the data as is this is all caried out ina a staging table. Once corrected the undated data is copied to the permanent table. (See demo here)
with fix_dt(fdt, fid, rn ) as -- cte to identify rows with valid dates
( select dt, id, row_number() over()
from stg_table
where dt is not null
order by id
) -- select * from fix_dt
, rfix(fdt,rid) as -- cte to transform consective dated rows to range of ids
(select fdt
, int8range(fid, lead(fid) over(order by fid), '()')
from fix_dt
) -- select * from rfix;
update stg_table -- apply the date to rows missing it
set dt = fdt
from rfix
where id <@ rid;
The above and the demo both assume the use of a staging table, however, I strongly recommend you create an id in your permanent table, just because you do not receive one. If fact if you do receive one do not use it for this purpose as you do not control it. Do not rely on your data you do not control to control your process.
-
is not a valid date. There is no reliable way to translate this to2020-06-02
. Because why should it be2020-06-02
instead of2020-06-03
? Advice: fix your data. – Frank Heikens Commented Feb 5 at 17:46DATE
type. Avoid using reserved keywords as identifiers (e.g., don't name a columnDATE
). If order matters, then each row must have a column, or set of columns, that can be used to define a unique ordering. – JohnH Commented Feb 6 at 0:23