最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Remove duplicate rows from a table, with more than one row to keep - Stack Overflow

programmeradmin1浏览0评论

I have two tables. Table posts:

id, -- unique primary key bigserial
post_id, -- integer
[...] -- other columns

And table posts_media:

id, -- unique primary key bigserial
post_db_id, -- bigint, reference to the `id` of table `posts`
post_id, -- integer, same value as `post_id` in the related row in `posts`
[...] -- other columns

Now I found out that the source (this is for a scraper application) can show the same post more than once, even after a longer time and I ended up with duplicate posts in my database. In order to prevent this I want to add a UNIQUE constraint on post_id in posts. Deleting the duplicate rows there is easy, but I also want to remove duplicates in posts_media.

The problem is, that one post can have multiple posts_media entries. I'm thinking the fact that I also store the post_db_id here can save me. In theory I'm thinking I should be able to do something like getting all the rows in posts_media with a duplicate post_id, then grouping those by post_db_id and removing all rows except those with the lowest post_db_id.

Is that correct thinking? And how would I do this in practice? I want to keep the oldest (lowest id or post_db_id) rows.


Example data in posts_media:

|id|post_db_id|post_id|other_columns
|1 |100       |10000  |...
|2 |100       |10000  |...
|3 |110       |10000  |...
|4 |110       |10000  |...
|5 |120       |10000  |...
|6 |120       |10000  |...
|7 |130       |20000  |...
|8 |130       |20000  |...
|9 |140       |20000  |...
|10|140       |20000  |...

With this example data, I want to keep rows with id 1, 2, 7, 8, and remove the rest.

I have two tables. Table posts:

id, -- unique primary key bigserial
post_id, -- integer
[...] -- other columns

And table posts_media:

id, -- unique primary key bigserial
post_db_id, -- bigint, reference to the `id` of table `posts`
post_id, -- integer, same value as `post_id` in the related row in `posts`
[...] -- other columns

Now I found out that the source (this is for a scraper application) can show the same post more than once, even after a longer time and I ended up with duplicate posts in my database. In order to prevent this I want to add a UNIQUE constraint on post_id in posts. Deleting the duplicate rows there is easy, but I also want to remove duplicates in posts_media.

The problem is, that one post can have multiple posts_media entries. I'm thinking the fact that I also store the post_db_id here can save me. In theory I'm thinking I should be able to do something like getting all the rows in posts_media with a duplicate post_id, then grouping those by post_db_id and removing all rows except those with the lowest post_db_id.

Is that correct thinking? And how would I do this in practice? I want to keep the oldest (lowest id or post_db_id) rows.


Example data in posts_media:

|id|post_db_id|post_id|other_columns
|1 |100       |10000  |...
|2 |100       |10000  |...
|3 |110       |10000  |...
|4 |110       |10000  |...
|5 |120       |10000  |...
|6 |120       |10000  |...
|7 |130       |20000  |...
|8 |130       |20000  |...
|9 |140       |20000  |...
|10|140       |20000  |...

With this example data, I want to keep rows with id 1, 2, 7, 8, and remove the rest.

Share Improve this question edited Mar 18 at 19:19 confetti asked Mar 18 at 18:51 confetticonfetti 1,1122 gold badges12 silver badges28 bronze badges 6
  • I don't see the issue here, you already found the solution. Just make a new unique column of all unique values in post_id, delete post_id and then turn the new column into post_id. You can decide to keep the oldest rows during the creating of the new column – ori raisfeld Commented Mar 18 at 19:06
  • It looks like there is a flaw in your design. Does having same post_id imply that the rest of the columns are also identical???? – Cetin Basoz Commented Mar 18 at 19:08
  • @oriraisfeld I'm not sure if either of us is misunderstanding, but this question is not about removing the duplicate rows in posts, it's about the posts_media table, where there is more than one row with the same post_id that I want to keep. To word it differently: I don't want to remove duplicate rows, rather I want to remove duplicates of multiple rows. All the duplicates will have the same post_id, but I don't want to keep just one of these rows, but all the rows that have the lowest post_db_id (which can be more than one row, with the samepost_db_id) – confetti Commented Mar 18 at 19:11
  • @CetinBasoz Which table are you talking about? posts_media has other columns, unrelated to this question, which can differ. It's fine for a single post to have multiple media's. There was a flaw in the posts table allowing duplicates, which I have fixed and added a UNIQUE constraint to make sure it can't happen again. Now I just want to clean up the posts_media table. I think my comment above might've clarified it a bit better. – confetti Commented Mar 18 at 19:14
  • I've edited the question to give an example, I hope that makes it clearer. – confetti Commented Mar 18 at 19:19
 |  Show 1 more comment

3 Answers 3

Reset to default 2

Foreign key with an on delete cascade

If your posts_media.post_db_id is a foreign key to posts.id, the entries will automatically get deleted while you purge posts if you declare your foreign key with on delete cascade (which was exactly made for that: guaranty database consistency by removing would-be orphans):

alter table posts_media
add constraint fk_media_post_id foreign key (post_db_id) references posts(id) on delete cascade;

delete from posts where id not in (select min(id) from posts group by post_id);

select * from posts_media;
id post_db_id post_id
1 100 10000
2 100 10000
7 130 20000
8 130 20000

(see it in a fiddle)

DML CTE

You can too rely on PostgreSQL ability to embed DELETE … RETURNING …s in Common Table Expressions,
thus chaining related DELETEs by passing deleted ids to a second DELETE (or even UPDATE or whatever you want):

with posts_deleted as
(
    delete from posts
    where id not in (select min(id) from posts group by post_id)
    returning id
)
delete from posts_media pm using posts_deleted pd where pm.post_db_id = pd.id;

(corresponding fiddle)

As with the ON DELETE CASCADE, your maintainability is eased by the fact that you have only one point of decision to choose which entries to delete: no risk of deleting too much or not enough in the secondary tables compared to the primary one, their purge "naturally" (and consistently!) follows.

Moreover, PostgreSQL being well specified and implemented, you get a consistently predictable behaviour (DML CTEs always access to the table as they were before the WITH_, with no interference between multiple DMLs in the same WITH; and DML are always applied, even if the final SELECT does not refer to them (while a DQL query might be skipped for optimization reasons)).

I think you have almost figured it out.

These are the IDs to retain

SELECT  MIN(post_db_id) AS min_post_db_id
FROM posts_media
GROUP BY post_id

which outputs

100
130

And then DELETE the post_db_id which is not equal to 100 or 130 using NOT IN

DELETE FROM posts_media
WHERE post_db_id
NOT IN 
(
SELECT  MIN(post_db_id) AS min_post_db_id
FROM posts_media
GROUP BY post_id
);

Fiddle Demo

Output after delete

id post_db_id post_id
1 100 10000
2 100 10000
7 130 20000
8 130 20000

While Samhita's answer is essentially doing the same thing and being more straightforward, running it on a big table (150M rows) seems to need a lot more resources (to the point where postgresql crashes on my VM), so I would like to present the approach I came up with aswell:

WITH dupes AS (
  SELECT id,post_id,post_db_id,
         MIN(post_db_id) OVER (PARTITION BY post_id) AS min_post_db_id
  FROM posts_media
)
DELETE FROM posts_media WHERE id IN (
  SELECT id FROM dupes WHERE post_db_id > min_post_db_id
);

This uses a WITH query to add the column min_post_db_id to each row that contains the smallest post_db_id for any row with the same post_id. The DELETE query then removes all rows where the post_db_id is above the lowest available post_db_id value.

发布评论

评论列表(0)

  1. 暂无评论