最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

query optimization - Why does adding OR to MySQL statement with WHERE IN causes a full table scan? - Stack Overflow

programmeradmin2浏览0评论

We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE IN (subquery) statement to one of our queries, I started to investigate what combination of statements triggers bad performance. I was surprised to learn that adding OR to a query that has WHERE IN significantly decreases performance, even if it is constant time.

Take for example this query. There's a session table (39 million rows), and a presenters table (32 million rows) that has a non-unique foreign key to sessions:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
);

/*
-> Nested loop inner join  (cost=1.45 rows=1) (actual time=0.0489..0.051 rows=1 loops=1)
    -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0301..0.032 rows=1 loops=1)
    -> Single-row index lookup on sessions using PRIMARY (session_id=presenters.session_id)  (cost=0.35 rows=1) (actual time=0.0171..0.0171 rows=1 loops=1)
*/

Now take a look what happens when I add an OR 1=0:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
)
OR 1=0; -- OR false also causes the execution plan change

/*
-> Filter: <in_optimizer>(sessions.session_id,sessions.session_id in (select #2))  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0549..78796 rows=1 loops=1)
    -> Table scan on sessions  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0252..52601 rows=34e+6 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((sessions.session_id = `<materialized_subquery>`.session_id))  (cost=1.3..1.3 rows=1) (actual time=522e-6..522e-6 rows=29.4e-9 loops=34e+6)
            -> Limit: 1 row(s)  (cost=1.2..1.2 rows=1) (actual time=419e-6..419e-6 rows=29.4e-9 loops=34e+6)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (session_id=sessions.session_id)  (actual time=308e-6..308e-6 rows=29.4e-9 loops=34e+6)
                    -> Materialize with deduplication  (cost=1.2..1.2 rows=1) (actual time=0.0224..0.0224 rows=1 loops=1)
                        -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0161..0.0177 rows=1 loops=1)
*/

Note that this does a full table scan. The first query ran in less than a millisecond, this one took a little over a minute. At this point we have to ban any use of OR combined with WHERE IN.

Is there any reason why this is happening? Is there anything we can do to work around this aside from making two separate queries and unioning the results?

We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE IN (subquery) statement to one of our queries, I started to investigate what combination of statements triggers bad performance. I was surprised to learn that adding OR to a query that has WHERE IN significantly decreases performance, even if it is constant time.

Take for example this query. There's a session table (39 million rows), and a presenters table (32 million rows) that has a non-unique foreign key to sessions:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
);

/*
-> Nested loop inner join  (cost=1.45 rows=1) (actual time=0.0489..0.051 rows=1 loops=1)
    -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0301..0.032 rows=1 loops=1)
    -> Single-row index lookup on sessions using PRIMARY (session_id=presenters.session_id)  (cost=0.35 rows=1) (actual time=0.0171..0.0171 rows=1 loops=1)
*/

Now take a look what happens when I add an OR 1=0:

EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
  select `presenters`.`session_id`
  from `presenters`
  where `presenters`.`user_id` = 71
)
OR 1=0; -- OR false also causes the execution plan change

/*
-> Filter: <in_optimizer>(sessions.session_id,sessions.session_id in (select #2))  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0549..78796 rows=1 loops=1)
    -> Table scan on sessions  (cost=3.27e+6 rows=31.9e+6) (actual time=0.0252..52601 rows=34e+6 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((sessions.session_id = `<materialized_subquery>`.session_id))  (cost=1.3..1.3 rows=1) (actual time=522e-6..522e-6 rows=29.4e-9 loops=34e+6)
            -> Limit: 1 row(s)  (cost=1.2..1.2 rows=1) (actual time=419e-6..419e-6 rows=29.4e-9 loops=34e+6)
                -> Index lookup on <materialized_subquery> using <auto_distinct_key> (session_id=sessions.session_id)  (actual time=308e-6..308e-6 rows=29.4e-9 loops=34e+6)
                    -> Materialize with deduplication  (cost=1.2..1.2 rows=1) (actual time=0.0224..0.0224 rows=1 loops=1)
                        -> Index lookup on presenters using presenters_user_id_foreign (user_id=71)  (cost=1.1 rows=1) (actual time=0.0161..0.0177 rows=1 loops=1)
*/

Note that this does a full table scan. The first query ran in less than a millisecond, this one took a little over a minute. At this point we have to ban any use of OR combined with WHERE IN.

Is there any reason why this is happening? Is there anything we can do to work around this aside from making two separate queries and unioning the results?

Share Improve this question edited Nov 18, 2024 at 17:17 jgawrych asked Nov 17, 2024 at 5:30 jgawrychjgawrych 3,5411 gold badge31 silver badges41 bronze badges 7
  • If you analyze how an index (usually a B-tree) is used, you will see that WHERE IN (...) renders the index not very usable or beneficial. – Tim Biegeleisen Commented Nov 17, 2024 at 5:32
  • 1 As far as I know the optimizer translates exists and in subqueries into joins when suited(see the first explain). Finding an or seems to rule this out(combining a subquery and an or to a join is not straightforward...) and the two queries strategy is taken. – Turo Commented Nov 17, 2024 at 8:24
  • Curiosity killed the cat: Did you try what happens when you put the or in the subquery? – Turo Commented Nov 17, 2024 at 8:38
  • @Turo - EXISTS is a "semijoin"; it works somewhat like a JOIN, but can stop when one row matches. IN is often optimized more poorly than the equivalent. – Rick James Commented Nov 17, 2024 at 16:11
  • @Turo for curiosity's sake, moving the OR 1=0 into the subquery did not cause the optimizer to change plans. Interestingly, I then tried converting the query into an EXISTS and it didn't change either. However, adding back my real condition, caused a completely different plan, but it's worse than ever, joining two full table index scans – jgawrych Commented Nov 18, 2024 at 17:32
 |  Show 2 more comments

1 Answer 1

Reset to default 2
  • When building a WHERE clause, avoid unnecessary things. (I, too, am surprised that the Optimizer failed to see that as OR FALSE and remove it.)

  • Write the query as a simple JOIN; the Optimizer is better at optimizing JOIN than IN ( SELECT ... ):

    SELECT s.*
        FROM presenters AS p
        JOIN sessions AS s  USING (session_id)
        WHERE p.user_id = 71
    
  • Have INDEX(session_id) on `sessions.

  • File a bug with bugs.mysql if you feel strongly enough about this issue.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论