最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

AWS redshift range join slow - Stack Overflow

programmeradmin3浏览0评论

I want to identify the location (country, city, area, latitude, longitude) details of clients who visit my site based on the client IP address for my analysis. We have IP address data with details of range of IP (fromIP,toIP) for each of the location. I will be running a job every day overnight to identify and update the location details for that day in my table which stores the client visit details.

Below is my table structure

CREATE TABLE CLIENT_VISIT
(VISIT_ID VARCHAR(65535), 
CLIENT_ID INTEGER,
CLIENTIP_VALUE BIGINT, 
VISIT_DATE TIMESTAMP WITHOUT TIME ZONE,
COUNTRY VARCHAR(180), 
CITY VARCHAR(180),
AREA VARCHAR(180), 
LATITUDE VARCHAR(60), 
LONGITUDE VARCHAR(60))
SORTKEY (   VISIT_DATE  );
    
CREATE TABLE IP_ADDRESS_DETAILS
(
FROM_IP_VALUE BIGINT,
TO_IP_VALUE BIGINT,
COUNTRY VARCHAR(180), 
CITY VARCHAR(180),
AREA VARCHAR(180), 
LATITUDE VARCHAR(60), 
LONGITUDE VARCHAR(60)
)SORTKEY (
    FROM_IP_VALUE,TO_IP_VALUE
    );

Below is my query

    SELECT cv.visit_id,cv.client_id ,ida.country,ida.city,ida.area,
        ida.latitude,ida.longitude
    FROM   client_visit cv,ip_address_details ida
    WHERE  cv.visit_date  >= trunc(sysdate-2)
    AND cv.visit_date  < trunc(sysdate-1)
    AND (cv.clientip_value >=  ida.from_ip_value) 
    AND (cv.clientip_value < ida.to_ip_value)

The table client_visit contains about 50 million rows. The table ip_address_details contains about 4 million rows.

Stand alone when i run the query on client_visit with the visit_date filter alone i get the result quickly. Result is about 100 thousand rows. But on joining with ip_address_details the query hangs and does not return even after 1hr.

I have tried with DISTSTYLE KEY and DISTKEY (to_ip_value) on ip_address_details table.

I have also tried creating Primary and foreign key on tables ip_address_details and client_visit on the IP address column to influence the query execution path. I have tried variety of other options of re-writing the query modifying the DISTSTYLE and DISTKEY.

Still when i try to identify the location details my query hangs. The plan always shows nested loop join and seq scan. I am relatively new to redshift. Any ideas/suggestions on how this can be improved. Any pointers on this will be of great help.

Below is the plan

XN Nested Loop DS_BCAST_INNER  (cost=70225.77..714363120.75 rows=53406584 width=606)
  Join Filter: (("inner".clientip_value < "outer".to_ip_value) AND ("inner".clientip_value >= "outer".from_ip_value))
  ->  XN Seq Scan on ip_address_details ida  (cost=0.00..36413.58 rows=3641358 width=98)
  ->  XN Materialize  (cost=70225.77..70227.09 rows=132 width=524)
        ->  XN Seq Scan on client_visit cv  (cost=0.00..1.64 rows=132 width=524)
                      Filter: ((process_time >= '2025-01-31 00:00:00'::timestamp without time zone) AND (process_time < '2025-02-01 00:00:00'::timestamp without time zone))
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

I have tried with creating primary key on IP_ADDRESS_DETAILS.TO_IP_VALUE and foreign key on CLIENT_VISIT.CLIENTIP_VALUE referencing the TO_IP_VALUE also. From the data i know below condition will get max of 1 row

AND (cv.clientip_value >=  ida.from_ip_value) 
AND (cv.clientip_value < ida.to_ip_value)

I want to identify the location (country, city, area, latitude, longitude) details of clients who visit my site based on the client IP address for my analysis. We have IP address data with details of range of IP (fromIP,toIP) for each of the location. I will be running a job every day overnight to identify and update the location details for that day in my table which stores the client visit details.

Below is my table structure

CREATE TABLE CLIENT_VISIT
(VISIT_ID VARCHAR(65535), 
CLIENT_ID INTEGER,
CLIENTIP_VALUE BIGINT, 
VISIT_DATE TIMESTAMP WITHOUT TIME ZONE,
COUNTRY VARCHAR(180), 
CITY VARCHAR(180),
AREA VARCHAR(180), 
LATITUDE VARCHAR(60), 
LONGITUDE VARCHAR(60))
SORTKEY (   VISIT_DATE  );
    
CREATE TABLE IP_ADDRESS_DETAILS
(
FROM_IP_VALUE BIGINT,
TO_IP_VALUE BIGINT,
COUNTRY VARCHAR(180), 
CITY VARCHAR(180),
AREA VARCHAR(180), 
LATITUDE VARCHAR(60), 
LONGITUDE VARCHAR(60)
)SORTKEY (
    FROM_IP_VALUE,TO_IP_VALUE
    );

Below is my query

    SELECT cv.visit_id,cv.client_id ,ida.country,ida.city,ida.area,
        ida.latitude,ida.longitude
    FROM   client_visit cv,ip_address_details ida
    WHERE  cv.visit_date  >= trunc(sysdate-2)
    AND cv.visit_date  < trunc(sysdate-1)
    AND (cv.clientip_value >=  ida.from_ip_value) 
    AND (cv.clientip_value < ida.to_ip_value)

The table client_visit contains about 50 million rows. The table ip_address_details contains about 4 million rows.

Stand alone when i run the query on client_visit with the visit_date filter alone i get the result quickly. Result is about 100 thousand rows. But on joining with ip_address_details the query hangs and does not return even after 1hr.

I have tried with DISTSTYLE KEY and DISTKEY (to_ip_value) on ip_address_details table.

I have also tried creating Primary and foreign key on tables ip_address_details and client_visit on the IP address column to influence the query execution path. I have tried variety of other options of re-writing the query modifying the DISTSTYLE and DISTKEY.

Still when i try to identify the location details my query hangs. The plan always shows nested loop join and seq scan. I am relatively new to redshift. Any ideas/suggestions on how this can be improved. Any pointers on this will be of great help.

Below is the plan

XN Nested Loop DS_BCAST_INNER  (cost=70225.77..714363120.75 rows=53406584 width=606)
  Join Filter: (("inner".clientip_value < "outer".to_ip_value) AND ("inner".clientip_value >= "outer".from_ip_value))
  ->  XN Seq Scan on ip_address_details ida  (cost=0.00..36413.58 rows=3641358 width=98)
  ->  XN Materialize  (cost=70225.77..70227.09 rows=132 width=524)
        ->  XN Seq Scan on client_visit cv  (cost=0.00..1.64 rows=132 width=524)
                      Filter: ((process_time >= '2025-01-31 00:00:00'::timestamp without time zone) AND (process_time < '2025-02-01 00:00:00'::timestamp without time zone))
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

I have tried with creating primary key on IP_ADDRESS_DETAILS.TO_IP_VALUE and foreign key on CLIENT_VISIT.CLIENTIP_VALUE referencing the TO_IP_VALUE also. From the data i know below condition will get max of 1 row

AND (cv.clientip_value >=  ida.from_ip_value) 
AND (cv.clientip_value < ida.to_ip_value)
Share Improve this question edited Feb 15 at 10:51 Balaji Govindan asked Feb 14 at 18:33 Balaji GovindanBalaji Govindan 114 bronze badges 2
  • MIght want to read this. docs.aws.amazon/redshift/latest/dg/c-the-query-plan.html Show us the exact indexes you have. – O. Jones Commented Feb 14 at 18:38
  • The ranged join between the two tables wouldn't be very efficient. Perhaps you could flatten it to have one entry per IP address (4 billion rows) -- this might sound like a lot, but it would be much more efficient for joining. Or, use a service like Google Analytics to display your location information rather than trying to calculate it yourself – John Rotenstein Commented Feb 16 at 23:05
Add a comment  | 

1 Answer 1

Reset to default 0

It looks very likely that you are creating trillion (or more) of rows in this JOIN operation. Creating the manipulating this amount of data will be very slow as it likely will cause spill. Run EXPLAIN on your query and look at the number of rows (n) that it estimates this JOIN will create. I expect you need to qualify this JOIN further.

发布评论

评论列表(0)

  1. 暂无评论