最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - MySQL query to avoid displaying duplicate records having same user_id, user_email, user_ip - Stack Overflow

programmeradmin1浏览0评论

I am not sure how to achieve this by a MySQL query:

I have a list of ads posted by multiple users and some users post multiple ads (e.g 10-20), when I load a list of ads (e.g sorted by ad_date DESC) I can run into a page full of duplicate spammy ads from same user.

Normally I can load them into a PHP array and loop over them to check user_id, user_email, user_ip, user_phone then only print one ad matching that detail and skip the rest, but this solution causes problems when visitor clicks to go to page 2, when I don't know on what ad nubmer to start, because I need to track how many ads I skipped in previous page, so that solution is not working.

How can I do something like this ?

SELECT * from ads but just one by each user_id, one by each user_email, one by each user_phone ORDER BY ad_date DESC;

Another solution I see would be to load all ads at once and loop over them all, then remove duplicates and decide what ads to print on each page number, but there are around 200 000 records and PHP page says it uses around 200 MB to load all ads, so that aproach is not working eider.

Edit: I am providing sample input and output:

So the expected behaviour is not to list all unique emails or phones, but rather identify that an ad by same person was already printed once, identified by eider one of the fileds (email, phone, user ip address)

// I removed user_id here because that causes some confusion

sample input

| ad_id | user_email        | user_phone   | ad_description | ad_date    | comment 
|-------|-------------------|--------------|----------------|------------|---------
| 1     |                   | 123-456-7891 | ad_desc1       | 2023-01-01 |
| 2     | [email protected] | 123-456-7891 | ad_desc2       | 2023-01-02 | skip because PHONE listed before
| 3     | [email protected] | 123-456-7892 | ad_desc3       | 2023-01-03 | skip because EMAIL listed before
| 4     | [email protected] | 123-456-7892 | ad_desc4       | 2023-02-01 | skip because PHONE listed before
| 5     | [email protected] |              | ad_desc5       | 2023-02-02 | skip because EMAIL listed before
| 6     | [email protected] | 123-456-7893 | ad_desc6       | 2023-03-01 |
| 7     |                   | 123-456-7892 | ad_desc7       | 2023-03-02 | skip because PHONE listed before
| 8     | [email protected] | 123-456-7895 | ad_desc8       | 2023-03-03 | skip because EMAIL listed before
| 9     | [email protected] |              | ad_desc9       | 2023-03-02 | skip because EMAIL listed before
| 10    |                   | 123-456-7899 | ad_desc10      | 2023-03-02 |

expected output

| ad_id | user_email        | user_phone   | ad_description | ad_date    |
|-------|-------------------|--------------|----------------|------------|
| 1     |                   | 123-456-7891 | ad_desc1       | 2023-01-01 |
| 6     | [email protected] | 123-456-7893 | ad_desc6       | 2023-03-01 |
| 10    |                   | 123-456-7899 | ad_desc10      | 2023-03-02 |

I am not sure how to achieve this by a MySQL query:

I have a list of ads posted by multiple users and some users post multiple ads (e.g 10-20), when I load a list of ads (e.g sorted by ad_date DESC) I can run into a page full of duplicate spammy ads from same user.

Normally I can load them into a PHP array and loop over them to check user_id, user_email, user_ip, user_phone then only print one ad matching that detail and skip the rest, but this solution causes problems when visitor clicks to go to page 2, when I don't know on what ad nubmer to start, because I need to track how many ads I skipped in previous page, so that solution is not working.

How can I do something like this ?

SELECT * from ads but just one by each user_id, one by each user_email, one by each user_phone ORDER BY ad_date DESC;

Another solution I see would be to load all ads at once and loop over them all, then remove duplicates and decide what ads to print on each page number, but there are around 200 000 records and PHP page says it uses around 200 MB to load all ads, so that aproach is not working eider.

Edit: I am providing sample input and output:

So the expected behaviour is not to list all unique emails or phones, but rather identify that an ad by same person was already printed once, identified by eider one of the fileds (email, phone, user ip address)

// I removed user_id here because that causes some confusion

sample input

| ad_id | user_email        | user_phone   | ad_description | ad_date    | comment 
|-------|-------------------|--------------|----------------|------------|---------
| 1     |                   | 123-456-7891 | ad_desc1       | 2023-01-01 |
| 2     | [email protected] | 123-456-7891 | ad_desc2       | 2023-01-02 | skip because PHONE listed before
| 3     | [email protected] | 123-456-7892 | ad_desc3       | 2023-01-03 | skip because EMAIL listed before
| 4     | [email protected] | 123-456-7892 | ad_desc4       | 2023-02-01 | skip because PHONE listed before
| 5     | [email protected] |              | ad_desc5       | 2023-02-02 | skip because EMAIL listed before
| 6     | [email protected] | 123-456-7893 | ad_desc6       | 2023-03-01 |
| 7     |                   | 123-456-7892 | ad_desc7       | 2023-03-02 | skip because PHONE listed before
| 8     | [email protected] | 123-456-7895 | ad_desc8       | 2023-03-03 | skip because EMAIL listed before
| 9     | [email protected] |              | ad_desc9       | 2023-03-02 | skip because EMAIL listed before
| 10    |                   | 123-456-7899 | ad_desc10      | 2023-03-02 |

expected output

| ad_id | user_email        | user_phone   | ad_description | ad_date    |
|-------|-------------------|--------------|----------------|------------|
| 1     |                   | 123-456-7891 | ad_desc1       | 2023-01-01 |
| 6     | [email protected] | 123-456-7893 | ad_desc6       | 2023-03-01 |
| 10    |                   | 123-456-7899 | ad_desc10      | 2023-03-02 |
Share Improve this question edited Jan 20 at 16:45 adrianTNT asked Jan 20 at 0:16 adrianTNTadrianTNT 4,0825 gold badges35 silver badges40 bronze badges 5
  • group BY user_id, user_email, user_ip and instead of Select * use an aggregation function for everyay column that is not in the group by – nbk Commented Jan 20 at 0:19
  • I am not familiar with aggregation functions – adrianTNT Commented Jan 20 at 1:02
  • one by each user_id, one by each user_email, one by each user_phone ??? Are you sure that you need in this and not in "one by each (user_id, user_email, user_phone) combination"? And - well, you solve this, you have one row from the duplicates bunch, you have definite values for listed columns and indefinitely/randomly selected values in all another columns.. do you need in such indefinite values really? PS. What is MySQL version precisely? – Akina Commented Jan 20 at 4:39
  • Mysql version is 5.7.44 one by ... not sure how to explain correctly, regardless of the field, I don't want to print another record if one was printed above (with eider same email, phone, etc), some user have same email+phone, some have no email but same phone. I found some queries to select individual ones, but after I filter these duplicates I actually need to load full row data (ad_id, ad_descrption), not just these problematic fields. – adrianTNT Commented Jan 20 at 11:39
  • you have neither user_id nor userip in yoiu sample data, add it else the query is hard to gette result you expect – nbk Commented Jan 20 at 22:25
Add a comment  | 

1 Answer 1

Reset to default 1

EDIT : Considered ad_description and ad_id as unique as per comment while user_id , user_email and user_phone are not unique as per comment.

Simple way would be to what is mentioned in comments to use group by and choose the max or min ad_date per user_id,user_email, user_phone etc

SELECT user_id, user_email, user_phone , max(ad_date) latest_ad_date
  from ads 
  group by user_id, user_email, user_phone ;

which for sample data

| user_id | user_email        | user_phone   | ad_date    |
|---------|-------------------|--------------|------------|
| 1       | [email protected] | 123-456-7890 | 2023-01-01 |
| 1       | [email protected] | 123-456-7890 | 2023-02-01 |
| 1       | [email protected] | 123-456-7890 | 2023-03-01 |
| 2       | [email protected] | 123-456-7891 | 2023-01-15 |
| 2       | [email protected] | 123-456-7891 | 2023-02-15 |
| 3       | [email protected] | 123-456-7892 | 2023-01-20 |
| 3       | [email protected] | 123-456-7892 | 2023-02-20 |
| 3       | [email protected] | 123-456-7892 | 2023-03-20 |

generates

| user_id | user_email        | user_phone   | latest_ad_date |
|---------|-------------------|--------------|----------------|
| 1       | [email protected] | 123-456-7890 | 2023-03-01     |
| 2       | [email protected] | 123-456-7891 | 2023-02-15     |
| 3       | [email protected] | 123-456-7892 | 2023-03-20     |

If you already have parameter for user_id, you can use a filter(where) and a limit to use DESC like you mentioned in the question

select user_id, user_email, user_phone , ad_date from ads 
  where user_id = 1
  order by ad_date desc
  limit 1;

generates

| user_id | user_email        | user_phone   | ad_date    |
|---------|-------------------|--------------|------------|
| 1       | [email protected] | 123-456-7890 | 2023-03-01 |

Another option is to rank each user based on user_id and order of ad_date

WITH rank_ads AS (
    SELECT  user_id, user_email, user_phone, ad_date,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ad_date DESC) AS rn
    FROM ads
)
SELECT
user_id, user_email, user_phone, ad_date    
FROM  rank_ads
WHERE rn = 1
ORDER BY  user_id ;

generates

| user_id | user_email        | user_phone   | ad_date    |
|---------|-------------------|--------------|------------|
| 1       | [email protected] | 123-456-7890 | 2023-03-01 |
| 2       | [email protected] | 123-456-7891 | 2023-02-15 |
| 3       | [email protected] | 123-456-7892 | 2023-03-20 |

EDIT as per comment

Basis my understanding, I created few sample data, note ad_desc and ad_id are unique while user_id, user_email, user_phone are not unique.

| ad_id | user_id | user_email        | user_phone   | ad_description | ad_date    |
|-------|---------|-------------------|--------------|----------------|------------|
| 1     | 1       | [email protected] | 123-456-7890 | ad_desc1       | 2023-01-01 |
| 2     | 1       | [email protected] | 123-456-7890 | ad_desc2       | 2023-01-02 |
| 3     | 1       | [email protected] | 123-456-7890 | ad_desc3       | 2023-01-03 |
| 4     | 2       | [email protected] | 123-456-7891 | ad_desc4       | 2023-02-01 |
| 5     | 2       | [email protected] | 123-456-7891 | ad_desc5       | 2023-02-02 |
| 6     | 3       | [email protected] | 123-456-7892 | ad_desc6       | 2023-03-01 |
| 7     | 3       | [email protected] | 123-456-7892 | ad_desc7       | 2023-03-02 |
| 8     | 3       | [email protected] | 123-456-7892 | ad_desc8       | 2023-03-03 |

In this case if you cant directly use group by because all the rows for each unique combination(user_id+user_email+user_phone+user_description) will be returned.

SELECT ad_id,user_id, user_email, user_phone ,ad_description , max(ad_date) latest_ad_date
  from ads 
  group by ad_id, user_id, user_email, user_phone,ad_description ;

generates

| ad_id | user_id | user_email        | user_phone   | ad_description | latest_ad_date |
|-------|---------|-------------------|--------------|----------------|----------------|
| 1     | 1       | [email protected] | 123-456-7890 | ad_desc1       | 2023-01-01     |
| 2     | 1       | [email protected] | 123-456-7890 | ad_desc2       | 2023-01-02     |
| 3     | 1       | [email protected] | 123-456-7890 | ad_desc3       | 2023-01-03     |
| 4     | 2       | [email protected] | 123-456-7891 | ad_desc4       | 2023-02-01     |
| 5     | 2       | [email protected] | 123-456-7891 | ad_desc5       | 2023-02-02     |
| 6     | 3       | [email protected] | 123-456-7892 | ad_desc6       | 2023-03-01     |
| 7     | 3       | [email protected] | 123-456-7892 | ad_desc7       | 2023-03-02     |
| 8     | 3       | [email protected] | 123-456-7892 | ad_desc8       | 2023-03-03     |

I guess this is not what you want, you can use window function for this

WITH rank_ads AS (
    SELECT  ad_id, user_id, user_email, user_phone, ad_description , ad_date,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ad_date DESC) AS rn
    FROM ads
)
SELECT
ad_id , user_id, user_email, user_phone, ad_description ,ad_date    
FROM  rank_ads
WHERE rn = 1
ORDER BY  user_id ;

generates

| ad_id | user_id | user_email        | user_phone   | ad_description | ad_date    |
|-------|---------|-------------------|--------------|----------------|------------|
| 3     | 1       | [email protected] | 123-456-7890 | ad_desc3       | 2023-01-03 |
| 5     | 2       | [email protected] | 123-456-7891 | ad_desc5       | 2023-02-02 |
| 8     | 3       | [email protected] | 123-456-7892 | ad_desc8       | 2023-03-03 |

Demo

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论