The MySQL Query Optimizer | High Performance MySQL: Optimization, Backups, Replication, and More

When you issue a query that selects rows, MySQL analyzes it to see if any optimizations can be used to process the query more quickly. In this section, we'll look at how the query optimizer works. For additional information, consult the optimization chapter in the MySQL Reference Manual; it describes various optimization measures that MySQL takes.

The MySQL query optimizer takes advantage of indexes, of course, but it also uses other information. For example, if you issue the following query, MySQL will execute it very quickly, no matter how large the table is:

 SELECT * FROM tbl_name WHERE 1 = 0;

In this case, MySQL looks at the WHERE clause, realizes that no rows can possibly satisfy the query, and doesn't even bother to search the table. You can see this by issuing an EXPLAIN statement, which tells MySQL to display some information about how it would execute a SELECT query without actually executing it. To use EXPLAIN, just put the word EXPLAIN in front of the SELECT statement:

 mysql> EXPLAIN SELECT * FROM tbl_name WHERE 1 = 0;  +------------------+ | Comment          | +------------------+ | Impossible WHERE | +------------------+

Normally, EXPLAIN returns more information than that, including information about the indexes that will be used to scan tables, the types of joins that will be used, and estimates of the number of rows that will need to be scanned from each table.

How the Optimizer Works

The MySQL query optimizer has several goals, but its primary aims are to use indexes whenever possible and to use the most restrictive index to eliminate as many rows as possible as soon as possible. That last part may sound backward because it's non-intuitive. After all, because your goal in issuing a SELECT statement is to find rows, not to reject them. The reason the optimizer works this way is that the faster it can eliminate rows from consideration, the more quickly the rows that do match your criteria can be found. Queries can be processed more quickly if the most restrictive tests can be done first. Suppose you have a query that tests two columns, each of which has an index on it:

 SELECT col3 FROM mytable  WHERE col1 = 'some value' AND col2 = 'some other value';

Suppose also that the test on col1 matches 900 rows, the test on col2 matches 300 rows, and that both tests succeed on 30 rows. Testing col1 first results in 900 rows that must be examined to find the 30 that also match the col2 value. That's 870 failed tests. Testing col2 first results in 300 rows that must be examined to find the 30 that also match the col1 value. That's only 270 failed tests, so less computation and disk I/O is required. As a result, the optimizer will attempt to test col2 first.

You can help the optimizer take advantage of indexes by using the following guidelines.

Try to compare columns that have the same type. When you use indexed columns in comparisons, use columns that are of the same type. For example, CHAR(10) is considered the same as CHAR(10) or VARCHAR(10) but different than CHAR(12) or VARCHAR(12). INT is different than BIGINT. Using columns of the same type is a requirement prior to MySQL 3.23, or indexes on the columns will not be used. From 3.23 on, this is not strictly necessary, but identical column types will still give you better performance than dissimilar types. If the columns you're comparing are of different types, you can use ALTER TABLE to modify one of them so that the types match.

Try to make indexed columns stand alone in comparison expressions. If you use a column in a function call or as part of a more complex term in an arithmetic expression, MySQL can't use the index because it must compute the value of the expression for every row. Sometimes this is unavoidable, but many times you can rewrite a query to get the indexed column to appear by itself.

The following WHERE clauses illustrate how this works. They are equivalent arithmetically, but quite different for optimization purposes. For the first line, the optimizer will simplify the expression 4/2 to the value 2 and then use an index on mycol to quickly find values less than 2. For the second expression, MySQL must retrieve the value of mycol for each row, multiply by 2, and then compare the result to 4. In this case, no index can be used, because each value in the column must be retrieved so that the expression on the left side of the comparison can be evaluated:

 WHERE mycol < 4 / 2  WHERE mycol * 2 < 4

Let's consider another example. Suppose you have an indexed column date_col. If you issue a query such as the following, the index isn't used:

 SELECT * FROM mytbl WHERE YEAR(date_col) < 1990;

The expression doesn't compare an indexed column to 1990; it compares a value calculated from the column value, and that value must be computed for each row. As a result, the index on date_col is not used because performing the query requires a full table scan. What's the fix? Just use a literal date, and the index on date_col can be used to find matching values in the columns:

 WHERE date_col < '1990-01-01'

But suppose you don't have a specific date. You might be interested instead in finding records that have a date that lies within a certain number of days from today. There are several ways to express a comparison of this type not all of which are equally good. Three possibilities are as follows:

 WHERE TO_DAYS(date_col) - TO_DAYS(CURDATE()) < cutoff  WHERE TO_DAYS(date_col) < cutoff + TO_DAYS(CURDATE()) WHERE date_col < DATE_ADD(CURDATE(), INTERVAL cutoff DAY)

For the first line, no index is used because the column must be retrieved for each row so that the value of TO_DAYS(date_col) can be computed. The second line is better. Both cutoff and TO_DAYS(CURDATE()) are constants, so the right hand side of the comparison can be calculated by the optimizer once before processing the query, rather than once per row. But the date_col column still appears in a function call, so the index isn't used. The third line is best of all. Again, the right side of the comparison can be computed once as a constant before executing the query, but now the value is a date. That value can be compared directly to date_col values, which no longer need to be converted to days. In this case, the index can be used.

Don't use wildcards at the beginning of a LIKE pattern. Sometimes people search for strings using a WHERE clause of the following form:

 WHERE col_name LIKE '%string%'

That's the correct thing to do if you want to find string no matter where it occurs in the column. But don't put '%' on both sides of the string simply out of habit. If you're really looking for the string only when it occurs at the beginning of the column, leave out the first '%'. Suppose you're looking in a column containing last names for names like MacGregor or MacDougall that begin with 'Mac'. In that case, write the WHERE clause like this:

 WHERE last_name LIKE 'Mac%'

The optimizer looks at the literal initial part of the pattern and uses the index to find rows that match as though you'd written the following expression, which is in a form that allows an index on last_name to be used:

 WHERE last_name >= 'Mac' AND last_name < 'Mad'

This optimization does not apply to pattern matches that use the REGEXP operator.

Help the optimizer make better estimates about index effectiveness. By default, when you are comparing values in indexed columns to a constant, the optimizer assumes that key values are distributed evenly within the index. The optimizer will also do a quick check of the index to estimate how many entries will be used when determining whether or not the index should be used for constant comparisons. For MyISAM and BDB tables, you can tell the server to perform an analysis of key values by using ANALYZE TABLE. This provides the optimizer with better information. Another option, for MyISAM tables, is to run myisamchk --analyze (or isamchk --analyze for ISAM tables). These utilities operate directly on the table files, so two conditions must be satisfied in order to use them for key analysis:

You must have an account on the MySQL server host that allows you write access to the table files.
You must cooperate with the server for access to the table files, because you don't want it to be accessing the table while you're working with its files. (Protocols for coordinating table access with the server are described in Chapter 13, "Database Backups, Maintenance, and Repair." Use the protocol that is appropriate for write access.)

Use EXPLAIN to verify optimizer operation. Check to see that indexes are being used in your query to reject rows quickly. If not, you might try using STRAIGHT_JOIN to force a join to be done using tables in a particular order. (Run the query both with and without STRAIGHT_JOIN; MySQL may have some good reason not to use indexes in the order you think is best.) As of MySQL 3.23.12, you can also try USE INDEX or IGNORE INDEX to give the server hints about which indexes to prefer.

Test alternate forms of queries, but run them more than once. When testing alternate forms of a query, run it several times each way. If you run a query only once each of two different ways, you'll often find that the second query is faster just because information from the first query is still in the disk cache and need not actually be read from the disk. You should also try to run queries when the system load is relatively stable to avoid effects due to other activities on your system.

Avoid overuse of MySQL's automatic type conversion. MySQL will perform automatic type conversion, but if you can avoid conversions, you may get better performance. For example, if num_col is an integer column, the following two queries both will return the same result:

 SELECT * FROM mytbl WHERE num_col = 4;  SELECT * FROM mytbl WHERE num_col = '4';

But the second query involves a type conversion. The conversion operation itself involves a small performance penalty for converting the integer and string to double to perform the comparison. A more serious problem is that if num_col is indexed, a comparison that involves type conversion may prevent the index from being used.

Overriding Optimization

It sounds odd, but there may be times when you'll want to defeat MySQL's optimization behavior. Some of the reasons to do this are described in the following list:

To empty a table with minimal side effects. When you need to empty a table completely, it's fastest to have the server just drop the table and re-create it based on the description stored in its .frm file. To do this, use a TRUNCATE TABLE statement:

 TRUNCATE TABLE tbl_name;

Prior to MySQL 4, you can achieve the same effect by using a DELETE statement with no WHERE clause:

 DELETE FROM tbl_name;

The server's optimization of emptying a table by re-creating it from scratch makes the operation extremely fast because each row need not be deleted individually. However, there are some side effects that may be undesirable under certain circumstances:

Prior to MySQL 4, DELETE with no WHERE clause may report the number of rows affected as zero, even when the table wasn't empty. TRUNCATE TABLE may do this for any version of MySQL, depending on the table type. Most of the time this doesn't matter, although it can be puzzling if you don't expect it. But for applications that require an accurate count of the number of deleted rows, a count of zero is not acceptable.
For MyISAM tables, AUTO_INCREMENT values normally are not reused when rows are deleted. (See Chapter 2, "Working with Data in MySQL.") However, emptying a table by re-creating it may reset the sequence to begin over at 1.

If you encounter these side effects and want to avoid them, use an "unoptimized" full-table DELETE statement that includes a trivially true WHERE clause:

 DELETE FROM tbl_name WHERE 1;

Adding the WHERE clause forces MySQL to do a row-by-row deletion, because it must evaluate the condition for each row to determine whether or not to delete it. The query executes much more slowly, but it will return the true number of rows deleted, and it will preserve the current AUTO_INCREMENT sequence number for MyISAM tables.

To override the optimizer's table join order. Use STRAIGHT_JOIN to force the optimizer to use tables in a particular order. If you do this, you should order the tables so that the first table is the one from which the smallest number of rows will be chosen. (If you are not sure which table this is, put the table with the most rows first.) In other words, try to order the tables to cause the most restrictive selection to come first. Queries perform better the earlier you can narrow the possible candidate rows. Make sure to try the query both ways; there may be some reason the optimizer isn't joining tables the way you think it should, and STRAIGHT_JOIN may not actually help.

Another possibility is to use the USE INDEX and IGNORE INDEX modifiers after a table name in the table list of a join to tell MySQL to use or ignore indexes. This may be helpful in cases where the optimizer doesn't make the correct choice.

To retrieve results in random order. As of MySQL 3.23.2, you can use ORDER BY RAND() to sort results randomly. Another technique, which is useful for older versions of MySQL, is to select a column of random numbers and sort on that column. However, if you try writing the query as follows, the optimizer defeats your intent:

 SELECT ..., RAND() as rand_col FROM ... ORDER BY rand_col;

The problem here is that MySQL sees that the column is a function call, thinks that the value of the column will be a constant, and optimizes the ORDER BY clause right out of the query! You can fool the optimizer by referring to a table column in the expression. For example, if your table has a column named age, you can write the query as follows:

 SELECT ..., age*0+RAND() as rand_col FROM ... ORDER BY rand_col;

In this case, the expression value is always equivalent to RAND(). But the optimizer doesn't know that, so it no longer guesses that the column contains a constant value in each row.

To avoid an endless update loop. Prior to MySQL 3.23.2, if you update a column that is indexed, it's possible for the rows that are updated to be updated endlessly if the column is used in the WHERE clause and the update moves the index value into the part of the range that hasn't been processed yet. Suppose the mytbl table has an integer column key_col that is indexed. Queries such as the following can cause problems:

 UPDATE mytbl SET key_col = key_col+1 WHERE key_col > 0;

The solution for this is to use key_col in an expression term in the WHERE clause such that MySQL can't use the index:

 UPDATE mytbl SET key_col = key_col+1 WHERE key_col+0 > 0;