Indexing Scenarios

< Day Day Up >

This section assesses an array of circumstances in which indexing can add impressive speed to your MySQL-based applications.

`PRIMARY KEY`

As you saw earlier, a primary key is defined as one or more column(s) that uniquely identify every row in a table. MySQL can use this information as a shortcut to finding the row with as little effort as possible, as well as a safeguard against duplicate data.

With very few exceptions, all of your tables should have primary keys. If you can't identify one or more columns that can serve as a primary key, it's a good idea to simply create a single numeric column and let MySQL populate it with unique data:

 CREATE TABLE pk_demo (     id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     col1 VARCHAR(10),     ...     ... );

In the preceding example, id serves as the primary key. MySQL creates a unique index for this column, and prevents duplicate entries:

 INSERT INTO pk_demo VALUES (1, 'Original'); INSERT INTO pk_demo VALUES (1, 'Duplicate'); ERROR 1062 (23000): Duplicate entry '1' for key 1

Each time you insert a row, MySQL increments this value by one. You should note the following two points about system-generated primary keys:

You have control over the starting point of this value. As of version 5.0.3, MyISAM and InnoDB both support the AUTO_INCREMENT directive, which lets you specify the first number in your sequence. For example, the next SQL statement will start the sequence at 1,000:
```
 CREATE TABLE auto_inc_demo (     col1 INTEGER PRIMARY KEY AUTO_INCREMENT,     col2 CHAR(10) )  AUTO_INCREMENT = 1000 ENGINE = INNODB; 
```
Don't make the system-generated primary key field any larger than necessary, but be certain you allocate enough room, especially if joins are involved. For example, if you define one table's primary key as SMALLINT and another's as TINYINT(2), the latter table can only hold 255 rows; additional row insert attempts receive a duplicate key error, and joins might be difficult between the two tables.

Finally, what if you can identify one or more columns that could serve as a primary key, but these columns are large and/or not numeric? In this case, you should probably still create a primary key and let MySQL fill it in with unique values. These values will come in handy when you try to speed the joins of this table to others; the benefits of numeric versus nonnumeric joins are discussed later. If you still want to create a multifield primary key, use the following syntax:

 CREATE TABLE pk_demo_multi (     pk_field1 INT,     pk_field2 VARCHAR(10),     ...     ...     PRIMARY KEY (pk_field1, pk_field2) );

Note that when your primary key is made up of multiple columns, it is the combination that must be unique; there can be many duplicate values in each column as long as the whole key is unique.

Filter Columns

Filter columns help speed the results of your database operations by reducing the number of potential rows that MySQL must process to satisfy your request. In the absence of an index, MySQL must perform a table scan to look at each row to see if it matches your criteria. On a large table, this can take a tremendous amount of time. To make matters worse, these costs are borne by statements in addition to SELECT, such as UPDATE and DELETE.

As a simple example, suppose that you have two sample tables defined as follows:

 CREATE TABLE sample_customers (     customer_id INT,     last_name VARCHAR(30),     first_name VARCHAR(30),     city VARCHAR(30) ); CREATE TABLE city_customer_count (     city VARCHAR(30),     customer_count INT );

This table will likely be very large, holding many millions of rows. Users are expected to submit scores of queries that filter on the last_name, as well as numerous updates that find rows for a particular city and then revise related records in other tables:

 SELECT customer_id, last_name, first_name FROM sample_customers sc WHERE last_name = 'Lebowski'; UPDATE city_customer_count ccc SET customer_count = (     SELECT COUNT(*) FROM sample_customers     WHERE city = 'Personville' ) WHERE ccc.city = 'Personville';

These types of operations can take a very long time to complete when there are no indexes in place on filter columns. Luckily, it's very simple to place indexes on these frequently filtered columns; dramatic performance improvements usually follow these kinds of enhancements:

 CREATE INDEX sc_ix1 ON sample_customers(last_name); CREATE INDEX sc_ix2 ON sample_customers(city); CREATE INDEX ccc_ix1 ON city_customer_count(city);

Both of these operations should run much more quickly now. MySQL can use the new indexes to rapidly locate the correct rows in either table.

It's important to understand that no index, including those on filters, is free. It's true that indexes consume additional disk space, but their true cost is often measured in the extra amount of time it takes for the database engine to update these indexes whenever a change (new row, updated row, deleted row) is made to the table. Over-indexed tables translate into slower data modifications, so be mindful of the benefits and costs of these crucial database structures.

Join Columns

When you join information between two or more tables, MySQL looks for any available indexes to help it locate the correct set of rows. In the absence of any indexes on join columns, MySQL is often forced to perform an expensive table scan of every row in an attempt to complete the join and locate your answer. This means that wherever possible, you should place indexes on join columns.

For example, suppose that you want to run a query to find all customers who joined the frequent flyer program during the month of December 2000 who have also redeemed an award during July 2006. First, take a look at the two tables in question:

 CREATE TABLE customer_master (     customer_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     ff_number CHAR(10),     last_name VARCHAR(50) NOT NULL,     first_name VARCHAR(50) NOT NULL,     home_phone VARCHAR(20),     mobile_phone VARCHAR(20),     fax VARCHAR(20),     email VARCHAR(40),     home_airport_code CHAR(3),     date_of_birth DATE,     sex ENUM ('M','F'),     date_joined_program DATE,     date_last_flew DATETIME ) ENGINE = INNODB; CREATE TABLE customer_awards (     award_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     ff_number CHAR(10) NOT NULL,     award_type_code TINYINT(2) NOT NULL,     mileage_redeemed SMALLINT NOT NULL,     date_redeemed DATE NOT NULL ) ENGINE = MYISAM;

Next, the following is the query to locate this information:

 SELECT cm.ff_number, cm.last_name, cm.first_name, ca.award_type_code, ca.date_redeemed FROM customer_master cm INNER JOIN customer_awards ca ON cm.ff_number = ca.ff_number AND cm.date_joined_program BETWEEN '2000-12-01' AND '2000-12-31' AND ca.date_redeemed BETWEEN '2006-07-01' AND '2006-07-31';

Several problems with these tables were defined previously. Of immediate concern in this section is the lack of an index on the ff_number column in the customer_awards; the importance of indexing filter columns was discussed earlier in this chapter.

No matter what else happens in the query, this missing join index might translate into a table scan to find rows to join to the customer_master table. Placing an index on this column is easy and can have a dramatic effect on query performance.

You should note that placing an index on a join column is not a magic bullet: In its absence, MySQL might well determine a different but equally valid, fast query plan. However, it's still a good idea to help the database engine by indexing these kinds of columns just be aware that they might not always solve all performance problems.

Composite indexes (that is, indexes that are made up of more than one column) are also a potential solution here. These indexes are discussed later in this chapter.

Finally, you can tune several server parameters to help boost join performance. These include join_buffer_size and max_join_size. All performance-related server parameters are reviewed later in the book.

Index Cardinality

As you saw earlier, MySQL's SHOW INDEX command returns details about the cardinality, or uniqueness, of the columns that make up an index. You should strive to make these values as high as possible to avoid highly duplicate indexes.

A highly duplicate index can be thought of as an index that only has a handful of potential key values. In many cases, it's worse to have a highly duplicate index than to have no index at all.

For example, suppose you added the following indexes to the customer_master example from earlier in this chapter:

 CREATE INDEX cm_ix1 ON customer_master (last_name, first_name); CREATE INDEX cm_ix2 ON customer_master (date_joined_program); CREATE INDEX cm_ix3 ON customer_master (sex);

The first two indexes make sense: They contain keys that are likely to be quite distinct. However, because the sex column will only contain either 'M' or 'F,' the third index will be highly duplicated.

When you add a row into a table with a highly duplicate index, the engine must work through the myriad of index pages containing the highly duplicate entry to find the correct place to register the new row. This degrades performance by requiring additional I/O with minimal benefit to your application. For filtering or joining, it's probably just as simple for MySQL to use a table scan because the index is so nonselective.

If you find yourself with a highly duplicate index and you still need to search or sort on the value, consider creating a new multicolumn index that combines the highly duplicate column with a more unique column. Remember that to correctly take advantage of a multicolumn index, your database operation must always include the leftmost column(s). This is discussed in more detail a little later in this chapter.

Hash Columns

Although a unique index or primary key is the best way to create a unique value, another approach to managing low cardinality situations is to take advantage of the MD5 hashing function to create a higher cardinality value from the concatenation of two or more columns, and then index this column.

For example, suppose that High-Hat Airways' catering department wants to maintain a system to track the inventory of all of their meals over time. In keeping with their ruthless cost-cutting policies, only a few different meal combinations (composed of type and size) are available, yet these two not-very-unique values will form the primary lookup criteria for queries. However, by setting aside and maintaining a hash column, the designers of the system make it easier to use an index to find the correct rows. The main table looks like this:

 CREATE TABLE meal_choices (     meal_date DATE NOT NULL,     meal_count SMALLINT(5) NOT NULL,     meal_type CHAR(20) NOT NULL,     meal_size CHAR(20) NOT NULL,     meal_hash_code CHAR(32),     INDEX (meal_hash_code) );

Now, when new rows are inserted to this table, the INSERT statement looks like this:

 INSERT INTO meal_choices VALUES ( '2006-06-10',250,'Vegetarian','Large',MD5(concat('Vegetarian','Large')) );

The meal_hash_code column could also be periodically populated via an UPDATE statement. In any case, the MD5 function is of use when querying the table because it will take advantage of the index:

 SELECT * FROM meal_choices WHERE meal_hash_code = MD5(concat('Low sodium','Small'));

The benefits of hashing only go so far: It would not be possible, for example, to utilize the index for a range query. The goal of this example was simply to highlight hash lookups and demonstrate how to use them.

Character Versus Numeric Indexes

When compared with character-based indexes, numeric indexes often offer faster access to information. As you analyze your tables, see if you can find a character-based, indexed column that contains exclusively numeric information. If you discover such a column, consider dropping the index and then changing the column's type to the following:

TINYINT if the numeric value for this column will never surpass 255 UNSIGNED, or will range between -128 and +128 if SIGNED
SMALLINT(2) if the numeric value will never surpass 65535 UNSIGNED, or will range between -32,767 and +32,767 SIGNED
INTEGER, which will consume 4 bytes to hold the data for the column. Values can range up to approximately 4 billion if UNSIGNED, or between approximately -2 billion and +2 billion if SIGNED

If you determine that some of the data includes decimal values, simply set the column type to match the table's information.

For example, we created two very simple tables to test the impact of numeric indexes versus character indexes:

 CREATE TABLE char_test (     col1 char(10),     col2 char(10),     INDEX (col1) ); CREATE TABLE numeric_test (     col1 INT UNSIGNED,     col2 char(10),     INDEX (col1) );

Next, we loaded 500,000 rows of random data into both tables. We kept the numeric values in col1 for both tables less than or equal to 999,999. We then ran a battery of index-activating tests, including SELECT AVG(col1), SELECT MIN(col1), SELECT MAX(col1), SELECT COUNT(*), and simple filtering.

In all cases, operations on the numeric table were faster; in some cases, they executed more rapidly by several orders of magnitude. For a larger table, the results would be even more dramatic. Finally, in addition to saving index and data storage space, our revised strategy also should translate into faster index processing because MySQL will need to read fewer bytes to determine if a row contains the requested value. This will also help matters if the column in question is used for filtering and/or joining.

Finally, if your column does indeed contain a mixture of character and numeric data, and this information follows a consistent pattern, don't give up hope. Perhaps there is a way to separate them into their own columns. This possibility is discussed in the next chapter, which examines strategies for fast SQL.

Multicolumn Indexes

Like most other relational database management systems, MySQL lets you create indexes that span more than one column. These indexes come in handy when trying filter, join, or sort on a set of values contained in different columns.

Using the customer_master example from earlier in the chapter, suppose that you are trying to find a customer with the last_name of "Crane." In the absence of an index on this field, MySQL is forced to scan all rows in the table to find matching values.

What if you add a single index on this column? In this case, MySQL uses the index and jumps directly to all the appropriate records. However, suppose you really want to find a specific customer: "Ed Crane," and High-Hat Airways has many thousands of customers with the last name of "Crane." You need a faster, more selective way to locate this customer. This is an example of how a multicolumn index can greatly reduce the amount of searching MySQL must perform to find your results. Multicolumn indexes are usually much more selective than single-column indexes, even if the values in each column are fairly static.

You can create a multicolumn index as follows:

 CREATE INDEX cm_last_first ON customer_master(last_name, first_name);

You could also specify this index when creating the table:

 CREATE TABLE customer_master (     customer_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     last_name VARCHAR(30) NOT NULL,     first_name VARCHAR(30) NOT NULL, ... ... INDEX(last_name, first_name) );

After the index is in place, finding "Ed Crane" should be very fast MySQL uses the index as a shortcut to the relevant rows.

Unfortunately, many developers mistakenly introduce SQL that does not take advantage of these indexes. Take a look at a few examples:

Unanchored index access What is wrong with the following query?
```
 SELECT * FROM customer_master WHERE first_name = 'Ed'; 
```
Our index begins with last_name and only then includes first_name; any database access that ignores last_name as a filter/join/sort value will likely render this index unusable. To use the index, you must reference the leftmost column(s).
Leading wildcard in index key Look at this query:
```
 SELECT * FROM CUSTOMER_MASTER WHERE last_name LIKE '%rane' AND first_name = 'Ed'; 
```
This query does indeed reference both indexed columns, but it uses a leading wildcard in the leftmost index column. This also forces a table scan to find appropriate rows. The following query, however, is still able to make some use of the index (via an index scan) to find rows with the correct last_name. Although this is still suboptimal, it is better than a full table scan:
```
 SELECT * FROM CUSTOMER_MASTER WHERE last_name = 'Crane' AND first_name LIKE '%Ed'; 
```
Incorrect order for sorting/grouping Look at the following query:
```
 SELECT * FROM CUSTOMER_MASTER WHERE last_name LIKE 'Crane%' ORDER BY first_name, last_name; 
```
At face value, this query looks efficient. However, there is a problem with the ORDER BY portion of the statement. The index on last_name, first_name does nothing when you try to sort on a different sequence. In this case, MySQL is forced to perform a filesort to order the information correctly. Changing the ORDER BY to match the index yields much better performance. Filesort was discussed as part of Chapter 6's MySQL query optimizer exploration.

Partial Indexes

As you have seen throughout this chapter, indexes are great tools for speeding up database applications. However, they can also impose extra costs. The price for indexing is primarily related to additional storage consumption as well as degraded performance when making modifications that affect one or more index keys.

However, MySQL lets you have it both ways: You can take advantage of the speed gains offered by indexes while also conserving storage and processing resources. The secret is in creating indexes that take only partial values of a column into account. The following sections look at two examples in which this makes sense.

Similar Values

Given that the purpose of an index is to help MySQL quickly locate one or more rows based on a certain criteria, it's a good idea to ensure that the index is as selective as possible. As you saw earlier, a highly duplicate index is often worse than no index at all. But what should you do if you have a CHAR or VARCHAR column that blends fairly unique and static information?

For example, suppose that you have a table of product information:

 CREATE TABLE product_info (     id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     product_code CHAR(30),     product_catalog_description TEXT, ... ... ) ENGINE = MYISAM;

Our research tells us that the product_code column needs to be indexed, has the following layout, and is most unique in the ten leftmost bytes:

 AAAAA-NNNN-AAAAAAAAAAAAAAAAAA

In other words, the AAAAA-NNNN portion of this column is relatively unique, whereas everything to its right is quite repetitive. In this case, it makes sense to only index that portion of the column that is most unique:

 CREATE INDEX prod_info_left ON product_info(product_code(10));

This index is very selective, yet consumes up to two-thirds less disk resources while decreasing CPU load when the index is updated. Note that you can construct multicolumn indexes using the same restrictions.

`TEXT`/`BLOB` Values

When defining an index for a TEXT or BLOB column, you must specify a prefix that indicates how many bytes you want to be included in the index.

One great way to determine this is to create a sample table that matches the structure of your production table. Next, load a representative subset of production data into the table, create a test index on a portion of the column(s), and then run OPTIMIZE TABLE followed by SHOW INDEX. Keep an eye on the cardinality value for this test index. When it starts dropping significantly, you'll know that you should not make the index any narrower.

Another technique is to use a query to get insight into the ideal prefix size:

 SELECT COUNT(DISTINCT LEFT(column_name, prefix_length)) FROM table_name;

Ascending Versus Descending Indexes

Currently, MySQL only supports ascending indexes, even if you mandate a descending sort in your index generation statement. However, at some point in the future you will be able to create indexes that are made up of a mixture of ascending and descending key values. For example, suppose that you want to track information about employees and their hire dates:

 CREATE TABLE employee (     last_name CHAR(40), ...     date_of_hire DATE, ...     INDEX (last_name ASC, date_of_hire DESC) );

This new functionality will add a great deal of speed and flexibility to queries that return large, sorted sets of data.

Storing Tables in Column-Sorted Order

Earlier in this chapter, you saw how both MyISAM and InnoDB have their own mechanisms to change the physical storage order for rows within a table. However, you have one additional sort option: By using the ALTER TABLE ... ORDER BY statement, you have control over how MySQL stores the physical data for your tables, regardless of whether an index is in place. For example, we created a sample table to hold transaction details:

 CREATE TABLE transactions (     transaction_id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     transaction_date DATETIME NOT NULL,     customer_id INT NOT NULL,     amount DECIMAL (5,2) NOT NULL,     transaction_type ENUM ('Purchase','Credit') ) ENGINE = MYISAM;

Next, we filled it with 100,000 rows of random data, and then created a copy of the table, called TRansactions_sorted. Finally, we sorted the latter table by customer_id:

 ALTER TABLE transactions_sorted ORDER BY customer_id;

Figure 7.6 shows the results of two queries against this information. The top query shows that the original TRansactions table is sorted by the primary key, but the bottom query (showing the altered transactions_sorted table) demonstrates that the table has been restructured in customer_id order.

Figure 7.6. Resultsets from the same query taken from two tables stored in different order.

Although after data is added this table will not remain in this state (unless you rerun the ALTER TABLE statement), there can be usability advantages to storing a table in a meaningful order. If the table had been defined with InnoDB instead of MyISAM, this would have no effect: InnoDB always orders its data by the clustered key.