I have a query that is extensively used for keyword searches. Due to the large volume of data in the table, the query takes a significant amount of time to execute. The query also includes ORDER BY and LIMIT, which further slow it down. The account table contains more than 15 million records.
Let me provide an example using one client. For client_id = 123, there are 5 million records. Out of these, 3.5 million records have the disabled column set to 'Y', and the remaining 1.5 million records have the disabled column set to 'N'.
Most of the time, the search is for active accounts, so the query looks for disabled = 'N'. Occasionally, the search includes both active and disabled accounts, so the query checks for disabled = 'Y' and disabled = 'N'.
For this client, the search query is very slow. Although there is an index on the client_id column, it doesn't provide much help. The disabled column is slowing down the query, followed by the keyword search columns and the ORDER BY clause.
The cardinality of client_id is low, with only around 1,500 unique values, and the disabled column has only two possible values: 'Y' and 'N'. So, I am struggling to optimize this query further.
SELECT *
FROM account
WHERE client_id = ?
AND disabled IN (?)
AND (username LIKE '%?%' OR name LIKE '%?%' OR surname LIKE '%?%')
ORDER BY name, surname
LIMIT ?;
Table definition:
CREATE TABLE account (
account_id INT NOT NULL AUTO_INCREMENT,
client_id BIGINT NOT NULL DEFAULT '0',
username VARCHAR(255) NOT NULL DEFAULT '',
name VARCHAR(255),
surname VARCHAR(255),
disabled CHAR(1) NOT NULL DEFAULT 'N',
PRIMARY KEY (account_id),
UNIQUE KEY username (username, client_id),
KEY siteid (client_id)
) ENGINE=InnoDB;
For one of the client_id, the explain plan is as below.
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,acc,,range,"PRIMARY,client_id",client_id,12,,743889,0.2,Using index condition; Using where; Using filesort
I have a query that is extensively used for keyword searches. Due to the large volume of data in the table, the query takes a significant amount of time to execute. The query also includes ORDER BY and LIMIT, which further slow it down. The account table contains more than 15 million records.
Let me provide an example using one client. For client_id = 123, there are 5 million records. Out of these, 3.5 million records have the disabled column set to 'Y', and the remaining 1.5 million records have the disabled column set to 'N'.
Most of the time, the search is for active accounts, so the query looks for disabled = 'N'. Occasionally, the search includes both active and disabled accounts, so the query checks for disabled = 'Y' and disabled = 'N'.
For this client, the search query is very slow. Although there is an index on the client_id column, it doesn't provide much help. The disabled column is slowing down the query, followed by the keyword search columns and the ORDER BY clause.
The cardinality of client_id is low, with only around 1,500 unique values, and the disabled column has only two possible values: 'Y' and 'N'. So, I am struggling to optimize this query further.
SELECT *
FROM account
WHERE client_id = ?
AND disabled IN (?)
AND (username LIKE '%?%' OR name LIKE '%?%' OR surname LIKE '%?%')
ORDER BY name, surname
LIMIT ?;
Table definition:
CREATE TABLE account (
account_id INT NOT NULL AUTO_INCREMENT,
client_id BIGINT NOT NULL DEFAULT '0',
username VARCHAR(255) NOT NULL DEFAULT '',
name VARCHAR(255),
surname VARCHAR(255),
disabled CHAR(1) NOT NULL DEFAULT 'N',
PRIMARY KEY (account_id),
UNIQUE KEY username (username, client_id),
KEY siteid (client_id)
) ENGINE=InnoDB;
For one of the client_id, the explain plan is as below.
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,acc,,range,"PRIMARY,client_id",client_id,12,,743889,0.2,Using index condition; Using where; Using filesort
Share
Improve this question
edited Mar 15 at 13:32
O. Jones
109k17 gold badges131 silver badges183 bronze badges
asked Mar 15 at 10:54
Ashik KAshik K
335 bronze badges
1
- You say for a client ID you can get millions of rows, and I see you put a LIMIT clause at the end of the query. I assume the LIMIT clause is supposed to reduce the result to a reasonable amount of rows, so after all you may often end up with just a sample of the resulting rows. If that is the case, you may want to apply LIMIT first and only then apply ORDER BY. This will still give you n sample matching rows, but the DBMS won't have to sort millions of rows to achieve this. Just an idea. You may want to give us more information about what the query is for, anyway. – Thorsten Kettner Commented Mar 17 at 13:24
3 Answers
Reset to default 1I have a two-stage scenario to enhance the performance:
1- You told cardinality of "client_id" is low, therefor search for an id results more than 5 percent of all records. this means that index on "client_id" is not suitable. Therefore I recommend to drop the index.
2- You may use "list partitioning" to partition the table based on "disabled" column. This help strongly you to search only "N" or "Y" records. It is helpful if you check partition creating on "client_id", because the cardinality is low. I guest partition creating on "client_id" is more helpful than on "disabled".
A few observations:
WHERE ... column LIKE '%searchString%'
is the most notorious of query performance antipatterns (except in PostgreSQL which offers so-called trigram indexes). It has to examine every character of every row and can't exploit indexes.If you can change to using
WHERE ... column LIKE 'searchString%'
without the leading%
wildcard character your searches will get a lot faster.SELECT * FROM many rows ORDER BY whatever LIMIT small_number
is also a notorious query performance antipattern. It has to build a big result set, sort it, then discard most of it.WHERE ... term OR term OR term
is also a query-performance antipattern.If you have a column like your
disabled
that you know contains only two values, you are much better off completely omitting yourWHERE ... disabled IN ('Y','N')
search term than you are including it.If you choose the character set and collation for your text columns wisely you'll save time and space. For example, if all your
username
values are plain old 8-bit ASCII and not Unicode declare that column withCOLLATE latin1_general_ci
.
You may wish to investigate using FULLTEXT search for this application if you can't switch to PostgreSQL. That overcomes the first antipattern I mentioned.
You can also try an approach like this, in which you first retrieve the account_id
values you need, then join to the whole table.
SELECT whole.*
FROM account whole
JOIN (
SELECT account_id
FROM (
SELECT account_id, name, surname
FROM account
WHERE client_id = ?
AND disabled = 'N'
AND username LIKE '%?%'
UNION
SELECT account_id, name, surname
FROM account
WHERE client_id = ?
AND disabled = 'N'
AND name LIKE '%?%'
UNION
SELECT account_id, name, surname
FROM account
WHERE client_id = ?
AND disabled = 'N'
AND surname LIKE '%?%'
) details
ORDER BY name, surname
LIMIT ?
) ids ON whole.account_id=ids.account_id
ORDER BY whole.name, whole.surname
These three indexes will probably help somewhat:
CREATE INDEX cli_dis_user ON account(client_id, disabled, username);
CREATE INDEX cli_dis_name ON account(client_id, disabled, name);
CREATE INDEX cli_dis_surn ON account(client_id, disabled, surname);
That's because one of them will accelerate, a bit, each of the nested UNIONed queries by allowing it to be satisfied by an index range scan.
Your query:
While https://stackoverflow/users/205608/o-jones has good answer, there's a scope of further optimisation.
SELECT *
FROM account
WHERE client_id = ?
AND disabled IN (?)
AND (username LIKE '%?%' OR name LIKE '%?%' OR surname LIKE '%?%')
ORDER BY name, surname
LIMIT ?;
Your constraints: client_id might have high number of rows.
How to optimize this query?
Observation 1:
No index would binary search on username LIKE %?%
or other similar LIKE
statements. One has to do a full index scan (this is different from table scan
that currently your query might be doing, its order of magnitudes faster).
Observation 2:
The conjunction between clauses in your query is OR
i.e. username
, name
OR surname
. That implies, you have to do a composite index containing these three.
Observation 3:
The other clauses contain client_id
, disabled
. However, selectivity of either of these are not as high. So, an individual index on any of these columns is of no use. However, a composite index containing these columns + above three columns do make sense but is optional [But index selection would only work if these guys are present]. So, till now we are at: client_id,disabled,username,name,surname
[I1]
Observation 4:
You want to optimize ORDER BY
clause as well. Only way to do that, is to keep the data physically sorted in the order of your clause. Remember, index is always physically sorted. So, let's change
client_id,disabled,username,name,surname
To:
name,surname,username,client_id,disabled
Long term items:
Consider partitioning your data.
The above index will speed up your query but would have negative consequences. So, you might want to drop the last two items and use an index hint.