I have two tables. Table posts
:
id, -- unique primary key bigserial
post_id, -- integer
[...] -- other columns
And table posts_media
:
id, -- unique primary key bigserial
post_db_id, -- bigint, reference to the `id` of table `posts`
post_id, -- integer, same value as `post_id` in the related row in `posts`
[...] -- other columns
Now I found out that the source (this is for a scraper application) can show the same post more than once, even after a longer time and I ended up with duplicate posts in my database. In order to prevent this I want to add a UNIQUE
constraint on post_id
in posts
. Deleting the duplicate rows there is easy, but I also want to remove duplicates in posts_media
.
The problem is, that one post
can have multiple posts_media
entries. I'm thinking the fact that I also store the post_db_id
here can save me. In theory I'm thinking I should be able to do something like getting all the rows in posts_media
with a duplicate post_id
, then grouping those by post_db_id
and removing all rows except those with the lowest post_db_id
.
Is that correct thinking? And how would I do this in practice? I want to keep the oldest (lowest id
or post_db_id
) rows.
Example data in posts_media
:
|id|post_db_id|post_id|other_columns
|1 |100 |10000 |...
|2 |100 |10000 |...
|3 |110 |10000 |...
|4 |110 |10000 |...
|5 |120 |10000 |...
|6 |120 |10000 |...
|7 |130 |20000 |...
|8 |130 |20000 |...
|9 |140 |20000 |...
|10|140 |20000 |...
With this example data, I want to keep rows with id 1
, 2
, 7
, 8
, and remove the rest.
I have two tables. Table posts
:
id, -- unique primary key bigserial
post_id, -- integer
[...] -- other columns
And table posts_media
:
id, -- unique primary key bigserial
post_db_id, -- bigint, reference to the `id` of table `posts`
post_id, -- integer, same value as `post_id` in the related row in `posts`
[...] -- other columns
Now I found out that the source (this is for a scraper application) can show the same post more than once, even after a longer time and I ended up with duplicate posts in my database. In order to prevent this I want to add a UNIQUE
constraint on post_id
in posts
. Deleting the duplicate rows there is easy, but I also want to remove duplicates in posts_media
.
The problem is, that one post
can have multiple posts_media
entries. I'm thinking the fact that I also store the post_db_id
here can save me. In theory I'm thinking I should be able to do something like getting all the rows in posts_media
with a duplicate post_id
, then grouping those by post_db_id
and removing all rows except those with the lowest post_db_id
.
Is that correct thinking? And how would I do this in practice? I want to keep the oldest (lowest id
or post_db_id
) rows.
Example data in posts_media
:
|id|post_db_id|post_id|other_columns
|1 |100 |10000 |...
|2 |100 |10000 |...
|3 |110 |10000 |...
|4 |110 |10000 |...
|5 |120 |10000 |...
|6 |120 |10000 |...
|7 |130 |20000 |...
|8 |130 |20000 |...
|9 |140 |20000 |...
|10|140 |20000 |...
With this example data, I want to keep rows with id 1
, 2
, 7
, 8
, and remove the rest.
3 Answers
Reset to default 2Foreign key with an on delete cascade
If your posts_media.post_db_id
is a foreign key to posts.id
, the entries will automatically get deleted while you purge posts
if you declare your foreign key with on delete cascade
(which was exactly made for that: guaranty database consistency by removing would-be orphans):
alter table posts_media
add constraint fk_media_post_id foreign key (post_db_id) references posts(id) on delete cascade;
delete from posts where id not in (select min(id) from posts group by post_id);
select * from posts_media;
id | post_db_id | post_id |
---|---|---|
1 | 100 | 10000 |
2 | 100 | 10000 |
7 | 130 | 20000 |
8 | 130 | 20000 |
(see it in a fiddle)
DML CTE
You can too rely on PostgreSQL ability to embed DELETE … RETURNING …
s in Common Table Expressions,
thus chaining related DELETE
s by passing deleted id
s to a second DELETE
(or even UPDATE
or whatever you want):
with posts_deleted as
(
delete from posts
where id not in (select min(id) from posts group by post_id)
returning id
)
delete from posts_media pm using posts_deleted pd where pm.post_db_id = pd.id;
(corresponding fiddle)
As with the ON DELETE CASCADE
, your maintainability is eased by the fact that you have only one point of decision to choose which entries to delete: no risk of deleting too much or not enough in the secondary tables compared to the primary one, their purge "naturally" (and consistently!) follows.
Moreover, PostgreSQL being well specified and implemented, you get a consistently predictable behaviour (DML CTEs always access to the table as they were before the WITH
_, with no interference between multiple DMLs in the same WITH
; and DML are always applied, even if the final SELECT
does not refer to them (while a DQL query might be skipped for optimization reasons)).
I think you have almost figured it out.
These are the IDs to retain
SELECT MIN(post_db_id) AS min_post_db_id
FROM posts_media
GROUP BY post_id
which outputs
100
130
And then DELETE the post_db_id which is not equal to 100 or 130 using NOT IN
DELETE FROM posts_media
WHERE post_db_id
NOT IN
(
SELECT MIN(post_db_id) AS min_post_db_id
FROM posts_media
GROUP BY post_id
);
Fiddle Demo
Output after delete
id | post_db_id | post_id |
---|---|---|
1 | 100 | 10000 |
2 | 100 | 10000 |
7 | 130 | 20000 |
8 | 130 | 20000 |
While Samhita's answer is essentially doing the same thing and being more straightforward, running it on a big table (150M rows) seems to need a lot more resources (to the point where postgresql crashes on my VM), so I would like to present the approach I came up with aswell:
WITH dupes AS (
SELECT id,post_id,post_db_id,
MIN(post_db_id) OVER (PARTITION BY post_id) AS min_post_db_id
FROM posts_media
)
DELETE FROM posts_media WHERE id IN (
SELECT id FROM dupes WHERE post_db_id > min_post_db_id
);
This uses a WITH
query to add the column min_post_db_id
to each row that contains the smallest post_db_id
for any row with the same post_id
. The DELETE
query then removes all rows where the post_db_id
is above the lowest available post_db_id
value.
posts
, it's about theposts_media
table, where there is more than one row with the samepost_id
that I want to keep. To word it differently: I don't want to remove duplicate rows, rather I want to remove duplicates of multiple rows. All the duplicates will have the samepost_id
, but I don't want to keep just one of these rows, but all the rows that have the lowestpost_db_id
(which can be more than one row, with the samepost_db_id
) – confetti Commented Mar 18 at 19:11posts_media
has other columns, unrelated to this question, which can differ. It's fine for a single post to have multiple media's. There was a flaw in theposts
table allowing duplicates, which I have fixed and added aUNIQUE
constraint to make sure it can't happen again. Now I just want to clean up theposts_media
table. I think my comment above might've clarified it a bit better. – confetti Commented Mar 18 at 19:14