最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Make groups based on the values from two separate columns - Stack Overflow

programmeradmin1浏览0评论

I have two columns in SQL, IP and Agent_ID. Each Agent_ID can have different IPs associated with it, and the same IP should refer to the same user even with a different Agent_ID. How can I create a new unique identifier in SQL to group different Agent_IDs with the same IP into one group while also ensuring users with the same Agent_IDs in the same group? For example,

user data:

IP Agent_ID
192.168.1.1 a
192.168.1.1 a
192.168.2.1 b
192.168.2.2 b
192.168.3.1 c
192.168.3.1 d

I have two columns in SQL, IP and Agent_ID. Each Agent_ID can have different IPs associated with it, and the same IP should refer to the same user even with a different Agent_ID. How can I create a new unique identifier in SQL to group different Agent_IDs with the same IP into one group while also ensuring users with the same Agent_IDs in the same group? For example,

user data:

IP Agent_ID
192.168.1.1 a
192.168.1.1 a
192.168.2.1 b
192.168.2.2 b
192.168.3.1 c
192.168.3.1 d

Query output:

IP Agent_ID Group
192.168.1.1 a 1
192.168.1.1 a 1
192.168.2.1 b 2
192.168.2.2 b 2
192.168.3.1 c 3
192.168.3.1 d 3
Share edited Feb 10 at 18:58 Dale K 27.4k15 gold badges58 silver badges83 bronze badges asked Feb 10 at 14:11 Charlie XuCharlie Xu 114 bronze badges 3
  • 1 What if there's Agent_ID=e with 192.168.1.2? The first agent b shares neither that IP nor Agent_ID, only the second b does. If you need all three to become one group in this scenario, you'll need iterative evaluation (recursive CTE or connect by) where at the end it's possible everyone turns out to be related transitively, through a chain of indirect links. – Zegarek Commented Feb 10 at 15:00
  • What user are you referring to? There's nothing about a user in your sample data. – Andrew Commented Feb 10 at 18:43
  • @Andrew User traffic analysis based on IP address and agent/client/app/browser fingerprint sounds like a common use case. I believe it's just a slice for the sake of an example and those are probably addresses and web browsers/apps the users connect from. It makes sense to cluster them like this to merge data from the same person connecting from multiple locations, devices and apps. – Zegarek Commented Feb 10 at 19:44
Add a comment  | 

2 Answers 2

Reset to default 1

Almost similar to the above answer but slightly shorter version in Snowflake.

For each agent_id, first find the min IP, which is indirect way of grouping the similar agents

SELECT ip, agent_id,
    MIN(IP) OVER (PARTITION BY agent_id) AS min_ip
    FROM  test11

which generates

IP AGENT_ID MIN_IP
192.168.1.1 a 192.168.1.1
192.168.1.1 a 192.168.1.1
192.168.2.1 b 192.168.2.1
192.168.2.2 b 192.168.2.1
192.168.3.1 c 192.168.3.1
192.168.3.1 d 192.168.3.1

And then we just rank it using DENSE_RANK() which ranks without gaps.

Final Query

WITH groups AS (
    SELECT ip, agent_id,
    MIN(IP) OVER (PARTITION BY agent_id) AS min_ip
    FROM  test11
)
SELECT  ip,agent_id,
    DENSE_RANK() OVER (ORDER BY min_ip) AS group_id
    FROM  groups ; 

Output

IP AGENT_ID GROUP_ID
192.168.1.1 a 1
192.168.1.1 a 1
192.168.2.1 b 2
192.168.2.2 b 2
192.168.3.1 c 3
192.168.3.1 d 3

As far as the question can be understood, your user is represented by an agent (Agent_ID). If the other agent has the same IP , then it is the same user.
So we can take min(Agent_ID) for all IP - first_Agent_ID.
This first_Agent_ID is your "unique identifier" of user.

See example

select *
  ,(select min(Agent_ID) from data d2 where d2.IP=d1.IP) first_Agent_ID
from data d1
id IP Agent_ID first_Agent_ID
1 192.168.1.1 a a
2 192.168.1.1 a a
3 192.168.2.1 b b
4 192.168.2.2 b b
5 192.168.3.1 c c
6 192.168.3.1 d c

if you want this identifier to be a number, wrap first_Agent_ID with the rank() or dense_rank() function.

select *
  ,dense_rank()over(order by (select min(Agent_ID) from data d2 where d2.IP=d1.IP)) GroupN
from data d1

id IP Agent_ID GroupN
1 192.168.1.1 a 1
2 192.168.1.1 a 1
3 192.168.2.1 b 2
4 192.168.2.2 b 2
5 192.168.3.1 c 3
6 192.168.3.1 d 3

fiddle

I added the Id column to the table just to illustrate the example. This column is not included in the queries.

You haven't answered an interesting question from @Zegarek yet. In his example, your table becomes a connectivity graph and requires a recursive solution.
Perhaps you need to consider this particular case.

发布评论

评论列表(0)

  1. 暂无评论