Assuming these two (simplified) tables:
CREATE TABLE books (
book_id VARCHAR(25) NOT NULL,
subgenre_1 VARCHAR(6) NOT NULL,
subgenre_2 VARCHAR(6),
subgenre_3 VARCHAR(6),
mood_1 VARCHAR(4) NOT NULL,
mood_2 VARCHAR(4),
mood_3 VARCHAR(4),
PRIMARY KEY (book_id)
);
CREATE TABLE collections (
collection_id VARCHAR(25) NOT NULL,
subgenre_1 VARCHAR(6) NOT NULL,
subgenre_2 VARCHAR(6),
subgenre_3 VARCHAR(6),
mood_1 VARCHAR(4) NOT NULL,
mood_2 VARCHAR(4),
mood_3 VARCHAR(4),
PRIMARY KEY (collection_id)
);
I'm trying to display the collections that match a specific book.
$bk_sg1 = "FICT";
$bk_sg2 = "HIST";
$bk_sg3 = "JUVE";
$bk_md1 = "ROMA";
$bk_md2 = "TENS";
$bk_md3 = "EMOT";
SELECT collection_id FROM collections
WHERE (subgenre_1 = $bk_sg1 OR subgenre_2 = $bk_sg1 OR subgenre_3 = $bk_sg1) OR
(subgenre_1 = $bk_sg2 OR subgenre_2 = $bk_sg2 OR subgenre_3 = $bk_sg2) OR
(subgenre_1 = $bk_sg3 OR subgenre_2 = $bk_sg3 OR subgenre_3 = $bk_sg3) OR
(mood_1 = $bk_md1 OR mood_2 = $bk_md1 OR mood_3 = $bk_md1) OR
(mood_1 = $bk_md2 OR mood_2 = $bk_md2 OR mood_3 = $bk_md2) OR
(mood_1 = $bk_md3 OR mood_2 = $bk_md3 OR mood_3 = $bk_md3)
Is there a more efficient way to do the SELECT statement? The collections table will be very large eventually and I don't want to worry about efficiency in the future if I can avoid it.
Assuming these two (simplified) tables:
CREATE TABLE books (
book_id VARCHAR(25) NOT NULL,
subgenre_1 VARCHAR(6) NOT NULL,
subgenre_2 VARCHAR(6),
subgenre_3 VARCHAR(6),
mood_1 VARCHAR(4) NOT NULL,
mood_2 VARCHAR(4),
mood_3 VARCHAR(4),
PRIMARY KEY (book_id)
);
CREATE TABLE collections (
collection_id VARCHAR(25) NOT NULL,
subgenre_1 VARCHAR(6) NOT NULL,
subgenre_2 VARCHAR(6),
subgenre_3 VARCHAR(6),
mood_1 VARCHAR(4) NOT NULL,
mood_2 VARCHAR(4),
mood_3 VARCHAR(4),
PRIMARY KEY (collection_id)
);
I'm trying to display the collections that match a specific book.
$bk_sg1 = "FICT";
$bk_sg2 = "HIST";
$bk_sg3 = "JUVE";
$bk_md1 = "ROMA";
$bk_md2 = "TENS";
$bk_md3 = "EMOT";
SELECT collection_id FROM collections
WHERE (subgenre_1 = $bk_sg1 OR subgenre_2 = $bk_sg1 OR subgenre_3 = $bk_sg1) OR
(subgenre_1 = $bk_sg2 OR subgenre_2 = $bk_sg2 OR subgenre_3 = $bk_sg2) OR
(subgenre_1 = $bk_sg3 OR subgenre_2 = $bk_sg3 OR subgenre_3 = $bk_sg3) OR
(mood_1 = $bk_md1 OR mood_2 = $bk_md1 OR mood_3 = $bk_md1) OR
(mood_1 = $bk_md2 OR mood_2 = $bk_md2 OR mood_3 = $bk_md2) OR
(mood_1 = $bk_md3 OR mood_2 = $bk_md3 OR mood_3 = $bk_md3)
Is there a more efficient way to do the SELECT statement? The collections table will be very large eventually and I don't want to worry about efficiency in the future if I can avoid it.
Share Improve this question edited Mar 18 at 20:04 Shadow 34.3k10 gold badges65 silver badges75 bronze badges asked Mar 18 at 19:04 Antonio RomoAntonio Romo 776 bronze badges 12 | Show 7 more comments2 Answers
Reset to default 3To improve readability and maintainability, you can use IN() instead of multiple OR conditions:
SELECT collection_id
FROM collections
WHERE subgenre_1 IN ('$bk_sg1', '$bk_sg2', '$bk_sg3')
OR subgenre_2 IN ('$bk_sg1', '$bk_sg2', '$bk_sg3')
OR subgenre_3 IN ('$bk_sg1', '$bk_sg2', '$bk_sg3')
OR mood_1 IN ('$bk_md1', '$bk_md2', '$bk_md3')
OR mood_2 IN ('$bk_md1', '$bk_md2', '$bk_md3')
OR mood_3 IN ('$bk_md1', '$bk_md2', '$bk_md3');
Add indexes on subgenre_1, subgenre_2, subgenre_3, mood_1, mood_2, mood_3
for faster searching
CREATE INDEX idx_subgenres ON collections (subgenre_1, subgenre_2, subgenre_3);
CREATE INDEX idx_moods ON collections (mood_1, mood_2, mood_3);
NOTE: This answer focus on the index optimization on you current table collections
. You can perform normalization by separating tables as others suggested. But for now, let's use your original table design .
Be careful when you have a composite index which involves multiple columns and your WHERE clause uses OR logical operator for the indexed columns. That composite index may not be enough to cover all search. That's to say, the following query will have subpar performance regarding the use of index:
create index idx on testtb(col1,col2);
select * from testtb
WHERE col1=$val1 or col2=$val2;
The above query has two issues:
First of all, an index (in our case, a B-TREE plus index) works from the leftmost prefix. So in the above query , col2=$val2
does not benefit from the index testtb(col1,col2)
,as the logical operator OR
makes col2=$val2
logically independent from col1=$val1
. However , if you use WHERE col1=$val1 AND col2=$val2
, our index is suffice. But since we use OR in our where clause, we will need two indexes.(one for each column which starts as the first column defined in respective index) Therefore, an addtional index should be created for the cause:
create index idx2 on testtb(col2);
Secondly, only one index can be used for a table at one time. That's to say, WHERE col1=$val or col2=$val2
can only use one index, being it idx or idx2, but not both. To fully use our indexes, we can make our one search on the table into two individual searches. To do this , we shall use UNION.
select * from testtb
WHERE col1=$val1
UNION
select * from testtb
WHERE col2=$val2;
That way, idx is used for the initial search while idx2 is used for the one below UNION.
Let's take an example in my test environment. I have a table named proctable
which has over a million rows:
select count(*) from proctable;
+----------+
| count(*) |
+----------+
| 1429158 |
+----------+
It has two indexes:
show index in proctable;
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| proctable | 0 | PRIMARY | 1 | id | A | 0 | NULL | NULL | | BTREE | | | YES | NULL |
| proctable | 1 | idx | 1 | id | A | 0 | NULL | NULL | | BTREE | | | YES | NULL |
| proctable | 1 | idx | 2 | num | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
However, none of them has column num
defined as the leftmost column. Now let's try our first query.
select * from proctable where id=999999 or num=99999;
+--------+------+------+------+
| id | num | jid | uid |
+--------+------+------+------+
| 999999 | 2030 | 29 | NULL |
+--------+------+------+------+
1 row in set (1.23 sec)
It took 1.23 second. Now let's improve the search by adding another index and modify our query a little bit.
create index idx2 on proctable(num);
Query OK, 0 rows affected (5.39 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> select * from proctable where id=999999
-> union
-> select * from proctable where num=99999;
+--------+------+------+------+
| id | num | jid | uid |
+--------+------+------+------+
| 999999 | 2030 | 29 | NULL |
+--------+------+------+------+
1 row in set (0.00 sec)
That's a huge improvement due to the precise index search. Here is the execution plan in case you are interested:
explain select * from proctable where id=999999 or num=99999;
+----+-------------+-----------+------------+------+------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+------------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | proctable | NULL | ALL | PRIMARY,idx,idx2 | NULL | NULL | NULL | 1 | 100.00 | Using where |
+----+-------------+-----------+------------+------+------------------+------+---------+------+------+----------+-------------+
explain select * from proctable where id=999999 union select * from proctable where num=99999;
+----+--------------+------------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+------------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
| 1 | PRIMARY | proctable | NULL | const | PRIMARY,idx | PRIMARY | 4 | const | 1 | 100.00 | NULL |
| 2 | UNION | proctable | NULL | ref | idx2 | idx2 | 5 | const | 1 | 100.00 | NULL |
| 3 | UNION RESULT | <union1,2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | NULL | Using temporary |
+----+--------------+------------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
As denoted above , our suboptimal query results a table scan due to only one index can be used.(the one which covers column id
or the one which covers column num
, but NOT BOTH) Nonetheless, our second query involving the use of UNION successfully turn the query into two separate searches which then profit on the index best suited for each. This is verified by the access type in the execution plan. (const
or ref
)
Finally, with everything demonstrated above, should we aim to achieve maximum index performance in your case, consider the following changes:
create index idx_subgenre_1 on collections(subgenre_1); -- this one is not not necessary if you already have a composite index which starts with column subgenre_1
create index idx_subgenre_2 on collections(subgenre_2);
create index idx_subgenre_3 on collections(subgenre_3);
create index idx_mood_1 on collections(mood_1); -- this one is not not necessary if you already have a composite index which starts with column mood_1
create index idx_mood_2 on collections(mood_2);
create index idx_mood_3 on collections(mood_3);
select collection_id from collections where subgenre_1 in ('$bk_sg1', '$bk_sg2', '$bk_sg3')
union
select collection_id from collections where subgenre_2 in ('$bk_sg1', '$bk_sg2', '$bk_sg3')
union
select collection_id from collections where subgenre_3 in ('$bk_sg1', '$bk_sg2', '$bk_sg3')
union
select collection_id from collections where mood_1 in ('$bk_md1', '$bk_md2', '$bk_md3')
union
select collection_id from collections where mood_2 in ('$bk_md1', '$bk_md2', '$bk_md3')
union
select collection_id from collections where mood_3 in ('$bk_md1', '$bk_md2', '$bk_md3');
genres
andmoods
tables and keep the "limit to 3" rule in your application business logic. Otherwise, I think anIN
might help this query, if I'm reading it correctly. – Chris Haas Commented Mar 18 at 19:27SET
would work exactly the same as what the OP has now — a table-scan. It would make the query take fewer keystrokes, but it would not have better performance. – Bill Karwin Commented Mar 19 at 0:47'$bk_sg1' IN (subgenre_1, subgenre_2, subgenre_3)
instead of theOR
condition. But as @ChrisHaas says, it would be better to normalize your tables so each subgenre is in a separate row, with a foreign key to thecollections
table. – Barmar Commented Mar 19 at 2:56