We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE IN (subquery)
statement to one of our queries, I started to investigate what combination of statements triggers bad performance. I was surprised to learn that adding OR
to a query that has WHERE IN
significantly decreases performance, even if it is constant time.
Take for example this query. There's a session table (39 million rows), and a presenters table (32 million rows) that has a non-unique foreign key to sessions:
EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
select `presenters`.`session_id`
from `presenters`
where `presenters`.`user_id` = 71
);
/*
-> Nested loop inner join (cost=1.45 rows=1) (actual time=0.0489..0.051 rows=1 loops=1)
-> Index lookup on presenters using presenters_user_id_foreign (user_id=71) (cost=1.1 rows=1) (actual time=0.0301..0.032 rows=1 loops=1)
-> Single-row index lookup on sessions using PRIMARY (session_id=presenters.session_id) (cost=0.35 rows=1) (actual time=0.0171..0.0171 rows=1 loops=1)
*/
Now take a look what happens when I add an OR 1=0
:
EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
select `presenters`.`session_id`
from `presenters`
where `presenters`.`user_id` = 71
)
OR 1=0; -- OR false also causes the execution plan change
/*
-> Filter: <in_optimizer>(sessions.session_id,sessions.session_id in (select #2)) (cost=3.27e+6 rows=31.9e+6) (actual time=0.0549..78796 rows=1 loops=1)
-> Table scan on sessions (cost=3.27e+6 rows=31.9e+6) (actual time=0.0252..52601 rows=34e+6 loops=1)
-> Select #2 (subquery in condition; run only once)
-> Filter: ((sessions.session_id = `<materialized_subquery>`.session_id)) (cost=1.3..1.3 rows=1) (actual time=522e-6..522e-6 rows=29.4e-9 loops=34e+6)
-> Limit: 1 row(s) (cost=1.2..1.2 rows=1) (actual time=419e-6..419e-6 rows=29.4e-9 loops=34e+6)
-> Index lookup on <materialized_subquery> using <auto_distinct_key> (session_id=sessions.session_id) (actual time=308e-6..308e-6 rows=29.4e-9 loops=34e+6)
-> Materialize with deduplication (cost=1.2..1.2 rows=1) (actual time=0.0224..0.0224 rows=1 loops=1)
-> Index lookup on presenters using presenters_user_id_foreign (user_id=71) (cost=1.1 rows=1) (actual time=0.0161..0.0177 rows=1 loops=1)
*/
Note that this does a full table scan. The first query ran in less than a millisecond, this one took a little over a minute. At this point we have to ban any use of OR
combined with WHERE IN
.
Is there any reason why this is happening? Is there anything we can do to work around this aside from making two separate queries and unioning the results?
We are using MySql 8.0.36. After running into a performance issue where a developer added an OR WHERE IN (subquery)
statement to one of our queries, I started to investigate what combination of statements triggers bad performance. I was surprised to learn that adding OR
to a query that has WHERE IN
significantly decreases performance, even if it is constant time.
Take for example this query. There's a session table (39 million rows), and a presenters table (32 million rows) that has a non-unique foreign key to sessions:
EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
select `presenters`.`session_id`
from `presenters`
where `presenters`.`user_id` = 71
);
/*
-> Nested loop inner join (cost=1.45 rows=1) (actual time=0.0489..0.051 rows=1 loops=1)
-> Index lookup on presenters using presenters_user_id_foreign (user_id=71) (cost=1.1 rows=1) (actual time=0.0301..0.032 rows=1 loops=1)
-> Single-row index lookup on sessions using PRIMARY (session_id=presenters.session_id) (cost=0.35 rows=1) (actual time=0.0171..0.0171 rows=1 loops=1)
*/
Now take a look what happens when I add an OR 1=0
:
EXPLAIN ANALYZE
select *
from `sessions`
where `sessions`.`session_id` IN (
select `presenters`.`session_id`
from `presenters`
where `presenters`.`user_id` = 71
)
OR 1=0; -- OR false also causes the execution plan change
/*
-> Filter: <in_optimizer>(sessions.session_id,sessions.session_id in (select #2)) (cost=3.27e+6 rows=31.9e+6) (actual time=0.0549..78796 rows=1 loops=1)
-> Table scan on sessions (cost=3.27e+6 rows=31.9e+6) (actual time=0.0252..52601 rows=34e+6 loops=1)
-> Select #2 (subquery in condition; run only once)
-> Filter: ((sessions.session_id = `<materialized_subquery>`.session_id)) (cost=1.3..1.3 rows=1) (actual time=522e-6..522e-6 rows=29.4e-9 loops=34e+6)
-> Limit: 1 row(s) (cost=1.2..1.2 rows=1) (actual time=419e-6..419e-6 rows=29.4e-9 loops=34e+6)
-> Index lookup on <materialized_subquery> using <auto_distinct_key> (session_id=sessions.session_id) (actual time=308e-6..308e-6 rows=29.4e-9 loops=34e+6)
-> Materialize with deduplication (cost=1.2..1.2 rows=1) (actual time=0.0224..0.0224 rows=1 loops=1)
-> Index lookup on presenters using presenters_user_id_foreign (user_id=71) (cost=1.1 rows=1) (actual time=0.0161..0.0177 rows=1 loops=1)
*/
Note that this does a full table scan. The first query ran in less than a millisecond, this one took a little over a minute. At this point we have to ban any use of OR
combined with WHERE IN
.
Is there any reason why this is happening? Is there anything we can do to work around this aside from making two separate queries and unioning the results?
Share Improve this question edited Nov 18, 2024 at 17:17 jgawrych asked Nov 17, 2024 at 5:30 jgawrychjgawrych 3,5411 gold badge31 silver badges41 bronze badges 7 | Show 2 more comments1 Answer
Reset to default 2When building a
WHERE
clause, avoid unnecessary things. (I, too, am surprised that the Optimizer failed to see that asOR FALSE
and remove it.)Write the query as a simple
JOIN
; the Optimizer is better at optimizingJOIN
thanIN ( SELECT ... )
:SELECT s.* FROM presenters AS p JOIN sessions AS s USING (session_id) WHERE p.user_id = 71
Have
INDEX(session_id)
on `sessions.File a bug with bugs.mysql if you feel strongly enough about this issue.
WHERE IN (...)
renders the index not very usable or beneficial. – Tim Biegeleisen Commented Nov 17, 2024 at 5:32EXISTS
is a "semijoin"; it works somewhat like aJOIN
, but can stop when one row matches.IN
is often optimized more poorly than the equivalent. – Rick James Commented Nov 17, 2024 at 16:11OR 1=0
into the subquery did not cause the optimizer to change plans. Interestingly, I then tried converting the query into anEXISTS
and it didn't change either. However, adding back my real condition, caused a completely different plan, but it's worse than ever, joining two full table index scans – jgawrych Commented Nov 18, 2024 at 17:32