Creating Useful Indexes | Inside Microsoft SQL Server 7.0 (Mps)

Creating useful indexes is one of the most important tasks you can do to achieve good performance. Indexes can dramatically speed up data retrieval and selection, but they are a drag on data modification because along with changes to the data, the index entries must also be maintained and those changes must be logged. The key to creating useful indexes is understanding the uses of the data, the types and frequencies of queries performed, and how queries can use indexes to help SQL Server find your data quickly. A CRUD chart or similar analysis technique can be invaluable in this effort. You might want to quickly review the difference between clustered and nonclustered indexes because the difference is crucial in deciding what kind of index to create.

Clustered and nonclustered indexes are similar at the upper (node) levels ”both are organized as B-trees. Index rows above the leaf level contain index key values and pointers to pages the next level down. Each row keeps track of the first key value on the page it points to. Figure 14-1 shows an abstract view of an index node for an index on a customer's last name . The entry Johnson indicates page 1:200 (file 1, page 200), which is at the next level of the index. Since Johnson and Jones are consecutive entries, all the entries on page 1:200 have values between Johnson (inclusive) and Jones (exclusive).

Figure 14-1. An index node page.

The leaf, or bottom, level of the index is where clustered and nonclustered indexes differ . For both kinds of indexes, the leaf level contains every key value in the table on which the index is built, and those keys are in sorted order. In a clustered index, the leaf level is the data level, so of course every key value is present. This means that the data in a table is sorted in order of the clustered index. In a nonclustered index, the leaf level is separate from the data. In addition to the key values, the index rows contain a bookmark indicating where to find the actual data. If the table has a clustered index, the bookmark is the clustered index key that corresponds to the nonclustered key in the row. (If the clustered key is composite, all parts of the key are included.)

Remember that clustered indexes are guaranteed to be unique in SQL Server 7; if you don't declare them as unique, SQL Server adds a uniqueifier to every duplicate key to turn the index into a unique composite index. If our index on last name is a nonclustered index and the clustered index on the table is the zip code, a leaf-level index page might look something like Figure 14-2, below. The number in parentheses after the zip code is the uniqueifier and appears only when there are duplicate zip codes.

Figure 14-2. A leaf-level index page.

Choose the Clustered Index Carefully

Clustered indexes are extremely useful for range queries (for example, WHERE sales_quantity BETWEEN 500 and 1000 ) and for queries in which the data must be ordered to match the clustering key. Only one clustered index can exist per table, since it defines the physical ordering of the data for that table. Since you can have only one clustered index per table, you should choose it carefully based on the most critical retrieval operations. Because of the clustered index's role in managing space within the table, nearly every table should have one. And if a table has only one index, it should probably be clustered.

If a table is declared with a primary key (which is advisable), by default the primary key columns form the clustered index. Again, this is because almost every table should have a clustered index, and if the table has only one index, it should probably be clustered. But if your table has several indexes, some other index might better serve as the clustered index. This is often true when you do single-row retrieval by primary key. A nonclustered, unique index works nearly as well in this case and still enforces the primary key's uniqueness. So save your clustered index for something that will benefit more from it by adding the keyword NONCLUSTERED when you declare the PRIMARY KEY constraint.

Make Nonclustered Indexes Highly Selective

A query using an index on a large table is often dramatically faster than a query doing a table scan. But this is not always true, and table scans are not all inherently evil. Nonclustered index retrieval means reading B-tree entries to determine the data page that is pointed to and then retrieving the page, going back to the B-tree, retrieving another data page, and so on until many data pages are read over and over. (Subsequent retrievals can be from cache.) With a table scan, the pages are read only once. If the index does not disqualify a large percentage of the rows, it is cheaper to simply scan the data pages, reading every page exactly once.

The query optimizer greatly favors clustered indexes over nonclustered indexes, because in scanning a clustered index the system is already scanning the data pages. Once it is at the leaf of the index, the system has gotten the data as well. So there is no need to read the B-tree, read the data page, and so on. This is why nonclustered indexes must be able to eliminate a large percentage of rows to be useful (that is, they must be highly selective), whereas clustered indexes are useful even with less selectivity.

Indexing on columns used in the WHERE clause of frequent or critical queries is often a big win, but this usually depends on how selective the index is likely to be. For example, if a query has the clause WHERE last_name = 'Stankowski' , an index on last_name is likely to be very useful; it can probably eliminate 99.9 percent of the rows from consideration. On the other hand, a nonclustered index will probably not be useful on a clause of WHERE sex = 'M' because it eliminates only about half of the rows from consideration; the repeated steps needed to read the B-tree entries just to read the data require far more I/O operations than simply making one single scan through all the data. So nonclustered indexes are typically not useful on columns that do not have a wide dispersion of values.

Think of selectivity as the percentage of qualifying rows in the table (qualifying rows/total rows). If the ratio of qualifying rows to total rows is low, the index is highly selective and is most useful. If the index is used, it can eliminate most of the rows in the table from consideration and greatly reduce the work that must be performed. If the ratio of qualifying rows to total rows is high, the index has poor selectivity and will not be useful. A nonclustered index is most useful when the ratio is around 5 percent or less ”that is, if the index can eliminate 95 percent of the rows from consideration. If the index has less than 5 percent selectivity, it probably will not be used; either a different index will be chosen or the table will be scanned. Recall that each index has a histogram of sampled data values for the index key, which the optimizer uses to estimate whether the index is selective enough to be useful to the query.

Tailor Indexes to Critical Transactions

Indexes speed data retrieval at the cost of additional work for data modification. To determine a reasonable number of indexes, you must consider the frequency of updates vs. retrievals and the relative importance of the competing types of work. If your system is almost purely a decision-support system (DSS) with little update activity, it makes sense to have as many indexes as will be useful to the queries being issued. A DSS might reasonably have a dozen or more indexes on a single table. If you have a predominantly online transaction processing (OLTP) application, you need relatively few indexes on a table ”probably just a couple carefully chosen ones.

Look for opportunities to achieve index coverage in queries, but don't get carried away. An index "covers" the query if it has all the data values needed as part of the index key. For example, if you have a query such as SELECT emp_name, emp_sex from employee WHERE emp_name LIKE 'Sm%' and you have a nonclustered index on emp_name , it might make sense to append the emp_sex column to the index key as well. Then the index will still be useful for the selection, but it will already have the value for emp_sex . The optimizer won't need to read the data page for the row to get the emp_sex value; the optimizer is smart enough to simply get the value from the B-tree key. The emp_sex column is probably a char(1) , so the column doesn't add greatly to the key length, and this is good.

Every nonclustered index is a covering index if all you are interested in is the key column of the index. For example, if you have a nonclustered index on first name, it covers all these queries:

Select all the first names that begin with K .
Find the first name that occurs most often.
Determine whether the table contains the name Melissa .

In addition, if the table also has a clustered index, every nonclustered index includes the clustering key. So it can also cover any queries that need the clustered key value in addition to the nonclustered key. For example, if our nonclustered index is on the first name and the table has a clustered index on the last name, the following queries can all be satisfied by accessing only leaf pages of the B-tree:

Select Tibor's last name.
Determine whether any duplicate first and last name combinations exist.
Find the most common first name for people with the last name Wong .

You can go too far and add all types of fields to the index. The net effect is that the index becomes a virtual copy of the table, just organized differently. Far fewer index entries fit on a page, I/O increases , cache efficiency is reduced, and much more disk space is required. The covered queries technique can improve performance in some cases, but you should use it with discretion.

A unique index (whether nonclustered or clustered) offers the greatest selectivity (that is, only one row can match), so it is most useful for queries that are intended to return exactly one row. Nonclustered indexes are great for single-row accesses via the PRIMARY KEY or UNIQUE constraint values in the WHERE clause.

Indexes are also important for data modifications, not just for queries. They can speed data retrieval for selecting rows, and they can speed data retrieval needed to find the rows that must be modified. In fact, if no useful index for such operations exists, the only alternative is for SQL Server to scan the table to look for qualifying rows. Update or delete operations on only one row are common; you should do these operations using the primary key (or other UNIQUE constraint index) values to be assured that there is a useful index to that row and no others.

A need to update indexed columns can affect the update strategy chosen. For example, to update a column that is part of the key of the clustered index on a table, you must process the update as a delete followed by an insert rather than as an update-in-place. When you decide which columns to index, especially which columns to make part of the clustered index, consider the effects the index will have on the update method used. (Review the discussion of updates in Chapter 8.)

Pay Attention to Column Order

At the risk of stating the obvious, an index can be useful to a query only if the criteria of the query match the columns that are leftmost in the index key. For example, if an index has a composite key of last_name,first_name , that index is useful for a query such as WHERE last_name = 'Smith' or WHERE last_name = 'Smith' AND first_name = 'John' . But it is not useful for a query such as WHERE first_name = 'John' . Think of using the index like a phone book. You use a phone book as an index on last name to find the corresponding phone number. But the standard phone book is useless if you know only a person's first name because the first name might be located on any page.

Put the most selective columns leftmost in the key of nonclustered indexes. For example, an index on emp_name,emp_sex is useful for a clause such as WHERE emp_name = 'Smith' AND emp_sex = 'M' . But if the index is defined as emp_sex,emp_name , it isn't useful for most retrievals. The leftmost key, emp_sex , cannot rule out enough rows to make the index useful. Be especially aware of this when it comes to indexes that are built to enforce a PRIMARY KEY or UNIQUE constraint defined on multiple columns. The index is built in the order that the columns are defined for the constraint. So you should adjust the order of the columns in the constraint to make the index most useful to queries; doing so will not affect its role in enforcing uniqueness.

Index Columns Used in Joins

Index columns are frequently used to join tables. When you create a PRIMARY KEY or UNIQUE constraint, an index is automatically created for you. But no index is automatically created for the referencing columns in a FOREIGN KEY constraint. Such columns are frequently used to join tables, so they are almost always among the most likely ones on which to create an index. If your primary key and foreign key columns are not naturally compact, consider creating a surrogate key using an identity column (or a similar technique). As with row length for tables, if you can keep your index keys compact, you can fit many more keys on a given page, which results in less physical I/O and better cache efficiency. And if you can join tables based on integer values such as an identity, you avoid having to do relatively expensive character-by-character comparisons. Ideally, columns used to join tables are integer columns ”fast and compact.

Join density is the average number of rows in one table that match a row in the table it is being joined to. You can also think of density as the average number of duplicates for an index key. A column with a unique index has the lowest possible density (there can be no duplicates) and is therefore extremely selective for the join. If a column being joined has a large number of duplicates, it has a high density and is not very selective for joins.

Joins are frequently processed as nested loops . For example, if while joining the orders table with order_items the system starts with the orders table (the outer table) and then for each qualifying order row, the inner table is searched for corresponding rows. Think of the join being processed as, "Given a specific row in the outer table, go find all corresponding rows in the inner table." If you think of joins in this way, you'll realize that it is important to have a useful index on the inner table, which is the one being searched for a specific value. For the most common type of join, an equijoin that looks for equal values in columns of two tables, the optimizer automatically decides which is the inner table and which is the outer table of a join. The table order that you specify for the join doesn't matter in the equijoin case. However, the order for outer joins must match the semantics of the query, so the resulting order is dependent on the order specified. We'll talk about join strategies later in this chapter.

Create or Drop Indexes as Needed

If you create indexes but find that they aren't used, you should drop them. Unused indexes slow data modification without helping retrieval. You can determine whether indexes are used by watching the plans produced via the SHOWPLAN options; this is easy if you are analyzing a large system with many tables and indexes. There might be thousands of queries that can be run and no way to run and analyze the SHOWPLAN output for all of them. An alternative is to use the Index Tuning Wizard to generate a report of current usage patterns. The wizard is designed to determine which new indexes to build, but you can use it simply as a reporting tool to find out what is happening in your current system. We'll look at the wizard later in this chapter.

Some batch-oriented processes that are query intensive can benefit from certain indexes. Such processes as complex reports or end-of-quarter financial closings often run infrequently. If this is the case, remember that creating and dropping indexes is simple. Consider a strategy of creating certain indexes in advance of your batch processes and then dropping them when those batch processes are done. In this way, the batch processes benefit from the indexes but do not add overhead to your OLTP usage.