sql - Optimizing MySQL Query Performance for Large Datasets with CHAR Column and Keyword Search

I have a query that is extensively used for keyword searches. Due to the large volume of data in the table, the query takes a significant amount of time to execute. The query also includes ORDER BY and LIMIT, which further slow it down. The account table contains more than 15 million records.

Let me provide an example using one client. For client_id = 123, there are 5 million records. Out of these, 3.5 million records have the disabled column set to 'Y', and the remaining 1.5 million records have the disabled column set to 'N'.

Most of the time, the search is for active accounts, so the query looks for disabled = 'N'. Occasionally, the search includes both active and disabled accounts, so the query checks for disabled = 'Y' and disabled = 'N'.

For this client, the search query is very slow. Although there is an index on the client_id column, it doesn't provide much help. The disabled column is slowing down the query, followed by the keyword search columns and the ORDER BY clause.

The cardinality of client_id is low, with only around 1,500 unique values, and the disabled column has only two possible values: 'Y' and 'N'. So, I am struggling to optimize this query further.

SELECT * 
FROM account
WHERE client_id = ?
  AND disabled IN (?)
  AND (username LIKE '%?%' OR name LIKE '%?%' OR surname LIKE '%?%')
ORDER BY name, surname
LIMIT ?;

Table definition:

CREATE TABLE account ( 
    account_id INT NOT NULL AUTO_INCREMENT, 
    client_id BIGINT NOT NULL DEFAULT '0', 
    username VARCHAR(255) NOT NULL DEFAULT '', 
    name VARCHAR(255), 
    surname VARCHAR(255),
    disabled CHAR(1) NOT NULL DEFAULT 'N', 
    PRIMARY KEY (account_id), 
    UNIQUE KEY username (username, client_id), 
    KEY siteid (client_id) 
) ENGINE=InnoDB;

For one of the client_id, the explain plan is as below.

id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,acc,,range,"PRIMARY,client_id",client_id,12,,743889,0.2,Using index condition; Using where; Using filesort

The cardinality of client_id is low, with only around 1,500 unique values, and the disabled column has only two possible values: 'Y' and 'N'. So, I am struggling to optimize this query further.

SELECT * 
FROM account
WHERE client_id = ?
  AND disabled IN (?)
  AND (username LIKE '%?%' OR name LIKE '%?%' OR surname LIKE '%?%')
ORDER BY name, surname
LIMIT ?;

Table definition:

CREATE TABLE account ( 
    account_id INT NOT NULL AUTO_INCREMENT, 
    client_id BIGINT NOT NULL DEFAULT '0', 
    username VARCHAR(255) NOT NULL DEFAULT '', 
    name VARCHAR(255), 
    surname VARCHAR(255),
    disabled CHAR(1) NOT NULL DEFAULT 'N', 
    PRIMARY KEY (account_id), 
    UNIQUE KEY username (username, client_id), 
    KEY siteid (client_id) 
) ENGINE=InnoDB;

For one of the client_id, the explain plan is as below.

id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,acc,,range,"PRIMARY,client_id",client_id,12,,743889,0.2,Using index condition; Using where; Using filesort

Share Improve this question edited Mar 15 at 13:32 O. Jones 109k17 gold badges131 silver badges183 bronze badges asked Mar 15 at 10:54 Ashik K 335 bronze badges

You say for a client ID you can get millions of rows, and I see you put a LIMIT clause at the end of the query. I assume the LIMIT clause is supposed to reduce the result to a reasonable amount of rows, so after all you may often end up with just a sample of the resulting rows. If that is the case, you may want to apply LIMIT first and only then apply ORDER BY. This will still give you n sample matching rows, but the DBMS won't have to sort millions of rows to achieve this. Just an idea. You may want to give us more information about what the query is for, anyway. – Thorsten Kettner Commented Mar 17 at 13:24

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

I have a two-stage scenario to enhance the performance:
1- You told cardinality of "client_id" is low, therefor search for an id results more than 5 percent of all records. this means that index on "client_id" is not suitable. Therefore I recommend to drop the index.
2- You may use "list partitioning" to partition the table based on "disabled" column. This help strongly you to search only "N" or "Y" records. It is helpful if you check partition creating on "client_id", because the cardinality is low. I guest partition creating on "client_id" is more helpful than on "disabled".

A few observations:

WHERE ... column LIKE '%searchString%' is the most notorious of query performance antipatterns (except in PostgreSQL which offers so-called trigram indexes). It has to examine every character of every row and can't exploit indexes.

If you can change to using WHERE ... column LIKE 'searchString%' without the leading % wildcard character your searches will get a lot faster.
SELECT * FROM many rows ORDER BY whatever LIMIT small_number is also a notorious query performance antipattern. It has to build a big result set, sort it, then discard most of it.
WHERE ... term OR term OR term is also a query-performance antipattern.
If you have a column like your disabled that you know contains only two values, you are much better off completely omitting your WHERE ... disabled IN ('Y','N') search term than you are including it.
If you choose the character set and collation for your text columns wisely you'll save time and space. For example, if all your username values are plain old 8-bit ASCII and not Unicode declare that column with COLLATE latin1_general_ci.

You may wish to investigate using FULLTEXT search for this application if you can't switch to PostgreSQL. That overcomes the first antipattern I mentioned.

You can also try an approach like this, in which you first retrieve the account_id values you need, then join to the whole table.

SELECT whole.*
  FROM account whole
  JOIN (
      SELECT account_id 
        FROM (
          SELECT account_id, name, surname
            FROM account
           WHERE client_id = ?
            AND disabled = 'N'
            AND username LIKE '%?%'  
          UNION
         SELECT account_id, name, surname
           FROM account
          WHERE client_id = ?
            AND disabled = 'N'
            AND name LIKE '%?%'  
          UNION
         SELECT account_id, name, surname
           FROM account
          WHERE client_id = ?
            AND disabled = 'N'
            AND surname LIKE '%?%'
        ) details
        ORDER BY name, surname
        LIMIT ?
     ) ids ON whole.account_id=ids.account_id
ORDER BY whole.name, whole.surname

These three indexes will probably help somewhat:

CREATE INDEX cli_dis_user ON account(client_id, disabled, username);
CREATE INDEX cli_dis_name ON account(client_id, disabled, name);
CREATE INDEX cli_dis_surn ON account(client_id, disabled, surname);

That's because one of them will accelerate, a bit, each of the nested UNIONed queries by allowing it to be satisfied by an index range scan.

Your query:

While https://stackoverflow/users/205608/o-jones has good answer, there's a scope of further optimisation.

SELECT * 
FROM account
WHERE client_id = ?
  AND disabled IN (?)
  AND (username LIKE '%?%' OR name LIKE '%?%' OR surname LIKE '%?%')
ORDER BY name, surname
LIMIT ?;

Your constraints: client_id might have high number of rows.

How to optimize this query?

Observation 1:

No index would binary search on username LIKE %?% or other similar LIKE statements. One has to do a full index scan (this is different from table scan that currently your query might be doing, its order of magnitudes faster).

Observation 2:

The conjunction between clauses in your query is OR i.e. username, name OR surname. That implies, you have to do a composite index containing these three.

Observation 3:

The other clauses contain client_id , disabled. However, selectivity of either of these are not as high. So, an individual index on any of these columns is of no use. However, a composite index containing these columns + above three columns do make sense but is optional [But index selection would only work if these guys are present]. So, till now we are at: client_id,disabled,username,name,surname [I1]

Observation 4:

You want to optimize ORDER BY clause as well. Only way to do that, is to keep the data physically sorted in the order of your clause. Remember, index is always physically sorted. So, let's change

client_id,disabled,username,name,surname

To:

name,surname,username,client_id,disabled

Long term items:

Consider partitioning your data.
The above index will speed up your query but would have negative consequences. So, you might want to drop the last two items and use an index hint.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

sql - Optimizing MySQL Query Performance for Large Datasets with CHAR Column and Keyword Search - Stack Overflow