Indexing is the most important tool you have for speeding up queries. There are other techniques available to you, too, but generally the one thing that will make the most difference is the proper use of indexes. On the MySQL mailing list, people often ask for help in making a query run faster. In a surprisingly large number of cases, there are no indexes on the tables in question, and adding indexes often solves the problem immediately. It doesn't always work like that, because optimization isn't always simple. Nevertheless, if you don't use indexes, in many cases you're just wasting your time trying to improve performance by other means. Use indexing first to get the biggest performance boost and then see what other techniques might be helpful. This section describes what an index is and how indexing improves query performance. It also discusses the circumstances under which indexes might degrade performance and provides guidelines for choosing indexes for your table wisely. In the next section, we'll discuss MySQL's query optimizer. It's good to have some understanding of the optimizer in addition to knowing how to create indexes because then you'll be better able to take advantage of the indexes you create. Certain ways of writing queries actually prevent your indexes from being useful, and generally you'll want to avoid having that happen. (Not always, though. Sometimes you'll want to override the optimizer's behavior. We'll cover some of those cases, too.) Benefits of IndexingLet's consider how an index works by beginning with a table that has no indexes. An unindexed table is simply an unordered collection of rows. For example, Figure 4.1 shows the ad table that we first saw in Chapter 1, "Getting Started with MySQL and SQL." There are no indexes on this table, so to find the rows for a particular company, it's necessary to examine each row in the table to see if it matches the desired value. This involves a full table scan, which is slow as well as tremendously inefficient if the table is large but contains only a few records matching the search criteria. Figure 4.1. Unindexed ad table.Figure 4.2 shows the same table but with the addition of an index on the company_num column in the ad table. The index contains an entry for each row in the ad table, but the index entries are sorted by company_num value. Now, instead of searching through the table row by row looking for items that match, we can use the index. Suppose we're looking for all rows for company 13. We begin scanning the index and find three rows for that company. Then we reach the row for company 14, a value higher than the one we're looking for. Index values are sorted, so when we read the record containing 14, we know we won't find any more matches and can quit looking. Thus, one efficiency gained by using the index is that we can tell where the matching rows end and can skip the rest. Another efficiency is that there are positioning algorithms for finding the first matching entry without doing a linear scan from the start of the index (for example, a binary search is much quicker than a scan). That way, we can quickly position to the first matching value and save a lot of time in the search. Databases use various techniques for positioning to index values quickly, but it's not so important here what those techniques are. What's important is that they work and that indexing is a good thing. Figure 4.2. Indexed ad table.You may be asking why we don't just sort the data file and dispense with the index file. Wouldn't that produce the same type of improvement in search speed? Yes, it would--if the table had a single index. But you might want to add a second index, and you can't sort the data file two different ways at once. (For example, you might want one index on customer names and another on customer ID numbers or phone numbers.) Using indexes as entities separate from the data file solves the problem and allows multiple indexes to be created. In addition, rows in the index are generally shorter than data rows. When you insert or delete new values, it's easier to move around shorter index values to maintain the sort order than to move around the longer data rows. The example just described corresponds in general to the way MySQL indexes tables, although the particular details vary for different table types. For example, for a MyISAM or ISAM table, the table's data rows are kept in a data file, and index values are kept in an index file. You can have more than one index on a table; if you do, they're all stored in the same index file. Each index in the index file consists of a sorted array of key records that are used for fast access into the data file. By contrast, the BDB and InnoDB table handlers do not separate data rows and index values in the same way, although both maintain indexes as sets of sorted values. The BDB handler uses a single file per table to store both data and index values, and the InnoDB handler uses a single tablespace within which it manages data and index storage for all InnoDB tables. The preceding discussion describes the benefit of an index in the context of single-table queries, where the use of an index speeds searches significantly by eliminating the need for full table scans. However, indexes are even more valuable when you're running queries involving joins on multiple tables. In a single-table query, the number of values you need to examine per column is the number of rows in the table. In a multiple-table query, the number of possible combinations skyrockets because it's the product of the number of rows in the tables. Suppose you have three unindexed tables, t1, t2, and t3, each containing a column c1, c2, and c3, respectively, and each consisting of 1000 rows that contain the numbers 1 through 1000. A query to find all combinations of table rows in which the values are equal looks like this: SELECT t1.c1, t2.c2, t3.c3 FROM t1, t2, t3 WHERE t1.c1 = t2.c2 AND t1.c1 = t3.c3; The result of this query should be 1000 rows, each containing three equal values. If we process the query in the absence of indexes, we have no idea which rows contain which values. Consequently, we must try all combinations to find the ones that match the WHERE clause. The number of possible combinations is 1000x1000x1000 (1 billion!), which is a million times more than the number of matches. That's a lot of wasted effort, and this query is likely to be very slow, even for a database such as MySQL that is very fast. And that is with only 1000 rows per table. What happens when you have tables with millions of rows? As tables grow, the time to process joins on those tables grows even more if no indexes are used, leading to very poor performance. If we index each table, we can speed things up considerably because indexing allows the query to be processed as follows:
In this case, we're still performing a full scan of table t1, but we're able to do indexed lookups on t2 and t3 to pull out rows from those tables directly. The query runs about a million times faster this way literally. (This example is contrived for the purpose of making a point, of course. Nevertheless, the problems it illustrates are real, and adding indexes to tables that have none often results in dramatic performance gains.) MySQL uses indexes as just described to speed up searches for rows matching terms of a WHERE clause or rows that match rows in other tables when performing joins. It also uses indexes to improve the performance of other types of operations:
Disadvantages of IndexingIn general, if MySQL can figure out how to use an index to process a query more quickly, it will. This means that, for the most part, if you don't index your tables, you're hurting yourself. You can see that I'm painting a rosy picture of the benefits of indexing. Are there disadvantages? Yes, there are. In practice, these drawbacks tend to be outweighed by the advantages, but you should know what they are. First, an index takes up disk space, and multiple indexes take up correspondingly more space. This may cause you to reach a table size limit more quickly than if there are no indexes:
Second, indexes speed up retrievals but slow down inserts and deletes as well as updates of values in indexed columns. That is, indexes slow down most operations involving writing. This occurs because writing a record requires writing not only the data row, it requires changes to any indexes as well. The more indexes a table has, the more changes need to be made, and the greater the average performance degradation. In the "Loading Data Efficiently" section later in this chapter, we'll go into more detail about this phenomenon and what you can do about it. Choosing IndexesThe syntax for creating indexes was covered in the "Creating and Dropping Indexes" section of Chapter 3, "MySQL SQL Syntax and Use." I assume here that you've read that section. But knowing syntax doesn't in itself help you determine how your tables should be indexed. That requires some thought about the way you use your tables. This section gives some guidelines on how to identify candidate columns for indexing and how best to set up indexes:
|