Creating Useful Indexes | Inside Microsoft SQL Server 2000

Creating useful indexes is one of the most important tasks you can do to achieve good performance. Indexes can dramatically speed up data retrieval and selection, but they are a drag on data modification because along with changes to the data, the index entries must also be maintained and those changes must be logged. The key to creating useful indexes is understanding the uses of the data, the types and frequencies of queries performed, and how queries can use indexes to help SQL Server find your data quickly. A CRUD chart or similar analysis technique can be invaluable in this effort. You might want to quickly review the difference between clustered and nonclustered indexes because the difference is crucial in deciding what kind of index to create.

Clustered and nonclustered indexes are similar at the upper (node) levels—both are organized as B-trees. Index rows above the leaf level contain index key values and pointers to pages the next level down. Each row keeps track of the first key value on the page it points to. Figure 16-1 shows an abstract view of an index node for an index on a customer's last name. The entry Johnson indicates page 1:200 (file 1, page 200), which is at the next level of the index. Since Johnson and Jones are consecutive entries, all the entries on page 1:200 have values between Johnson (inclusive) and Jones (exclusive).

Figure 16-1. An index node page.

The leaf, or bottom, level of the index is where clustered and nonclustered indexes differ. For both kinds of indexes, the leaf level contains every key value in the table on which the index is built, and those keys are in sorted order. In a clustered index, the leaf level is the data level, so of course every key value is present. This means that the data in a table is sorted in order of the clustered index. In a nonclustered index, the leaf level is separate from the data. In addition to the key values, the index rows contain a bookmark indicating where to find the actual data. If the table has a clustered index, the bookmark is the clustered index key that corresponds to the nonclustered key in the row. (If the clustered key is composite, all parts of the key are included.)

Remember that clustered indexes are guaranteed to be unique in SQL Server 2000; if you don't declare them as unique, SQL Server adds a uniqueifier to every duplicate key to turn the index into a unique composite index. If our index on last name is a nonclustered index and the clustered index on the table is the zip code, a leaf-level index page might look something like Figure 16-2. The number in parentheses after the zip code is the uniqueifier and appears only when there are duplicate zip codes.

Figure 16-2. A leaf-level index page.

Choose the Clustered Index Carefully

Clustered indexes are extremely useful for range queries (for example, WHERE sales_quantity BETWEEN 500 and 1000) and for queries in which the data must be ordered to match the clustering key. Only one clustered index can exist per table, since it defines the physical ordering of the data for that table. Since you can have only one clustered index per table, you should choose it carefully based on the most critical retrieval operations. Because of the clustered index's role in managing space within the table, nearly every table should have one. And if a table has only one index, it should probably be clustered.

If a table is declared with a primary key (which is advisable), by default the primary key columns form the clustered index. Again, this is because almost every table should have a clustered index, and if the table has only one index, it should probably be clustered. But if your table has several indexes, some other index might better serve as the clustered index. This is often true when you do single-row retrieval by primary key. A nonclustered, unique index works nearly as well in this case and still enforces the primary key's uniqueness. So save your clustered index for something that will benefit more from it by adding the keyword NONCLUSTERED when you declare the PRIMARY KEY constraint.

Make Nonclustered Indexes Highly Selective

A query using an index on a large table is often dramatically faster than a query doing a table scan. But this is not always true, and table scans are not all inherently evil. Nonclustered index retrieval means reading B-tree entries to determine the data page that is pointed to and then retrieving the page, going back to the B-tree, retrieving another data page, and so on until many data pages are read over and over. (Subsequent retrievals can be from cache.) With a table scan, the pages are read only once. If the index does not disqualify a large percentage of the rows, it is cheaper to simply scan the data pages, reading every page exactly once.

The query optimizer greatly favors clustered indexes over nonclustered indexes, because in scanning a clustered index the system is already scanning the data pages. Once it is at the leaf of the index, the system has gotten the data as well. So there is no need to read the B-tree, read the data page, and so on. This is why nonclustered indexes must be able to eliminate a large percentage of rows to be useful (that is, highly selective), whereas clustered indexes are useful even with less selectivity.

Indexing on columns used in the WHERE clause of frequent or critical queries is often a big win, but this usually depends on how selective the index is likely to be. For example, if a query has the clause WHERE last_name = 'Stankowski', an index on last_name is likely to be very useful; it can probably eliminate 99.9 percent of the rows from consideration. On the other hand, a nonclustered index will probably not be useful on a clause of WHERE sex = 'M' because it eliminates only about half of the rows from consideration; the repeated steps needed to read the B-tree entries just to read the data require far more I/O operations than simply making one single scan through all the data. So nonclustered indexes are typically not useful on columns that do not have a wide dispersion of values.

Think of selectivity as the percentage of qualifying rows in the table (qualifying rows/total rows). If the ratio of qualifying rows to total rows is low, the index is highly selective and is most useful. If the index is used, it can eliminate most of the rows in the table from consideration and greatly reduce the work that must be performed. If the ratio of qualifying rows to total rows is high, the index has poor selectivity and will not be useful. A nonclustered index is most useful when the ratio is around 5 percent or less—that is, if the index can eliminate 95 percent of the rows from consideration. If the index has less than 5 percent selectivity, it probably will not be used; either a different index will be chosen or the table will be scanned. Recall that each index has a histogram of sampled data values for the index key, which the query optimizer uses to estimate whether the index is selective enough to be useful to the query.

Tailor Indexes to Critical Transactions

Indexes speed data retrieval at the cost of additional work for data modification. To determine a reasonable number of indexes, you must consider the frequency of updates vs. retrievals and the relative importance of the competing types of work. If your system is almost purely a decision-support system (DSS) with little update activity, it makes sense to have as many indexes as will be useful to the queries being issued. A DSS might reasonably have a dozen or more indexes on a single table. If you have a predominantly online transaction processing (OLTP) application, you need relatively few indexes on a table—probably just a couple carefully chosen ones.

Look for opportunities to achieve index coverage in queries, but don't get carried away. An index "covers" the query if it has all the data values needed as part of the index key. For example, if you have a query such as SELECT emp_name, emp_sex from employee WHERE emp_name LIKE 'Sm%' and you have a nonclustered index on emp_name, it might make sense to append the emp_sex column to the index key as well. Then the index will still be useful for the selection, but it will already have the value for emp_sex. The query optimizer won't need to read the data page for the row to get the emp_sex value; the query optimizer is smart enough to simply get the value from the B-tree key. The emp_sex column is probably a char(1), so the column doesn't add greatly to the key length, and this is good.

Every nonclustered index is a covering index if all you are interested in is the key column of the index. For example, if you have a nonclustered index on first name, it covers all these queries:

Select all the first names that begin with K.
Find the first name that occurs most often.
Determine whether the table contains the name Melissa.

In addition, if the table also has a clustered index, every nonclustered index includes the clustering key. So it can also cover any queries that need the clustered key value in addition to the nonclustered key. For example, if our nonclustered index is on the first name and the table has a clustered index on the last name, the following queries can all be satisfied by accessing only leaf pages of the B-tree:

Select Tibor's last name.
Determine whether any duplicate first and last name combinations exist.
Find the most common first name for people with the last name Wong.

You can go too far and add all types of fields to the index. The net effect is that the index becomes a virtual copy of the table, just organized differently. Far fewer index entries fit on a page, I/O increases, cache efficiency is reduced, and much more disk space is required. The covered queries technique can improve performance in some cases, but you should use it with discretion.

A unique index (whether nonclustered or clustered) offers the greatest selectivity (that is, only one row can match), so it is most useful for queries that are intended to return exactly one row. Nonclustered indexes are great for single-row accesses via the PRIMARY KEY or UNIQUE constraint values in the WHERE clause.

Indexes are also important for data modifications, not just for queries. They can speed data retrieval for selecting rows, and they can speed data retrieval needed to find the rows that must be modified. In fact, if no useful index for such operations exists, the only alternative is for SQL Server to scan the table to look for qualifying rows. Update or delete operations on only one row are common; you should do these operations using the primary key (or other UNIQUE constraint index) values to be assured that there is a useful index to that row and no others.

A need to update indexed columns can affect the update strategy chosen. For example, to update a column that is part of the key of the clustered index on a table, you must process the update as a delete followed by an insert rather than as an update-in-place. When you decide which columns to index, especially which columns to make part of the clustered index, consider the effects the index will have on the update method used. (Review the discussion of updates in Chapter 9.)

SQL Server 2000 provides a new function that can give us a workaround if you need to index a large character field. The CHECKSUM function computes a checksum on a row or a column, and its value is always a 4-byte integer. You can create a computed column to be the checksum of your large character field and then build an index on that computed column. The values returned by CHECKSUM are not guaranteed to be absolutely unique, but there will be few duplicates. Since there is the possibility of two character string values having the same value for the checksum, your queries will need to include the full string that you're looking for.

Here's an example. Suppose we're expecting the titles table in the pubs database to grow to over a million titles by the end of the year. We want to be able to do quick lookups on book names, but since the title field is 80 characters wide, we know that an index on title is not ideal. (In fact, the titles table does have an index on title, but since it's not really such a good idea, we'll drop it.) So we'll create a computed column using CHECKSUM and index that:

 DROP INDEX titles.titleind GO ALTER TABLE titles ADD hash_title AS CHECKSUM(title) GO CREATE INDEX hash_index ON titles(hash_title) GO

First, let's try just searching for a particular title:

 SELECT * FROM titles WHERE title = 'Cooking with Computers: Surreptitious Balance Sheets'

If you look at the query plan for this SELECT, you'll see that a clustered index scan is done, which means the whole table must be searched. Instead, let's also query on the checksum value in the hash_title column:

 SELECT * FROM titles WHERE title = 'Cooking with Computers: Surreptitious Balance Sheets' AND hash_title = CHECKSUM('Cooking with Computers: Surreptitious Balance Sheets')

The query plan will now show that SQL Server can do an index seek on the computed column hash_title. We'll look at query plans in detail later in this chapter.

Pay Attention to Column Order

At the risk of stating the obvious, an index can be useful to a query only if the criteria of the query match the columns that are leftmost in the index key. For example, if an index has a composite key of last_name,first_name, that index is useful for a query such as WHERE last_name = 'Smith' or WHERE last_name = 'Smith' AND first_name = 'John'. But it is not useful for a query such as WHERE first_name = 'John'. Think of using the index like a phone book. You use a phone book as an index on last name to find the corresponding phone number. But the standard phone book is useless if you know only a person's first name because the first name might be located on any page.

You should put the most selective columns leftmost in the key of nonclustered indexes. For example, an index on emp_name,emp_sex is useful for a clause such as WHERE emp_name = 'Smith' AND emp_sex = 'M'. But if the index is defined as emp_sex,emp_name, it isn't useful for most retrievals. The leftmost key, emp_sex, cannot rule out enough rows to make the index useful. Be especially aware of this when it comes to indexes that are built to enforce a PRIMARY KEY or UNIQUE constraint defined on multiple columns. The index is built in the order that the columns are defined for the constraint. So you should adjust the order of the columns in the constraint to make the index most useful to queries; doing so will not affect its role in enforcing uniqueness.

Index Columns Used in Joins

Index columns are frequently used to join tables. When you create a PRIMARY KEY or UNIQUE constraint, an index is automatically created for you. But no index is automatically created for the referencing columns in a FOREIGN KEY constraint. Such columns are frequently used to join tables, so they are almost always among the most likely ones on which to create an index. If your primary key and foreign key columns are not naturally compact, consider creating a surrogate key using an identity column (or a similar technique). As with row length for tables, if you can keep your index keys compact, you can fit many more keys on a given page, which results in less physical I/O and better cache efficiency. And if you can join tables based on integer values such as an identity, you avoid having to do relatively expensive character-by-character comparisons. Ideally, columns used to join tables are integer columns—fast and compact.

I talked about density in the statistics discussion in Chapter 15. When determining the density of a SARG, the query optimizer can look up a specific value in the statistics histogram to get an estimate of the number of occurrences. However, when it looks at whether an index is useful for a join operation, there is no specific value that it can look up. The query optimizer needs to know how many rows in one table are likely to match the join columns value from the other table. In other words, if two tables are related in a one-to-many relationship, how many is "many"? We use the term join density to mean the average number of rows in one table that match a row in the table it is being joined to. You can also think of join density as the average number of duplicates for an index key. A column with a unique index has the lowest possible join density (there can be no duplicates) and is therefore extremely selective for the join. If a column being joined has a large number of duplicates, it has a high density and is not very selective for joins.

As you learned in Chapter 15, joins are frequently processed as nested loops. For example, if while joining the orders table with order_items the system starts with the orders table (the outer table) and then for each qualifying order row, the inner table is searched for corresponding rows. Think of the join being processed as, "Given a specific row in the outer table, go find all corresponding rows in the inner table." If you think of joins in this way, you'll realize that it's important to have a useful index on the inner table, which is the one being searched for a specific value. For the most common type of join, an equijoin that looks for equal values in columns of two tables, the query optimizer automatically decides which is the inner table and which is the outer table of a join. The table order that you specify for the join doesn't matter in the equijoins case. However, the order for outer joins must match the semantics of the query, so the resulting order is dependent on the order specified.

Create or Drop Indexes as Needed

If you create indexes but find that they aren't used, you should drop them. Unused indexes slow data modification without helping retrieval. You can determine whether indexes are used by watching the plans produced via the SHOWPLAN options. However, this isn't easy if you're analyzing a large system with many tables and indexes. There might be thousands of queries that can be run and no way to run and analyze the SHOWPLAN output for all of them. An alternative is to use the Index Tuning Wizard to generate a report of current usage patterns. The wizard is designed to determine which new indexes to build, but you can use it simply as a reporting tool to find out what is happening in your current system. We'll look at the wizard in the next section.

Some batch-oriented processes that are query intensive can benefit from certain indexes. Such processes as complex reports or end-of-quarter financial closings often run infrequently. If this is the case, remember that creating and dropping indexes is simple. Consider a strategy of creating certain indexes in advance of your batch processes and then dropping them when those batch processes are done. In this way, the batch processes benefit from the indexes but do not add overhead to your OLTP usage.

The Index Tuning Wizard

You've seen that the query optimizer can come up with a reasonably effective query plan even in the absence of well-planned indexes on your tables. However, this does not mean that a well-tuned database doesn't need good indexes. The query plans that don't rely on indexes frequently consume additional system memory, and other applications might suffer. In addition, having appropriate indexes in place can help solve many blocking problems.

Designing the best possible indexes for the tables in your database is a complex task; it not only requires a thorough knowledge of how SQL Server uses indexes and how the query optimizer makes its decisions, but it requires that you be intimately familiar with how your data will actually be used. The SQL Server 2000 Index Tuning Wizard is a powerful tool for helping you design the best possible indexes.

You can access the Index Tuning Wizard from the Wizards list in SQL Server Enterprise Manager or from the Tools menu in SQL Profiler. You can also access the wizard from the Query menu in SQL Query Analyzer. The wizard tunes a single database at a time, and it bases its recommendations on a workload file that you provide. The workload file can be a file of captured trace events that contains at least the RPC and SQL Batch starting or completed events, as well as the text of the RPCs and batches. It can also be a file of SQL statements. If you use SQL Query Analyzer to invoke the wizard, you also have the option of tuning just the current queries in your SQL Query Analyzer window. If you use SQL Profiler to create the workload file, you can capture all SQL statement submitted by all users over a period of time. The wizard can then look at the data access patterns for all users, for all tables, and make recommendations with everyone in mind.

The book's companion CD includes a whitepaper describing the Index Tuning Wizard (indextuning.doc). You can check the Microsoft web site periodically to see if any new whitepapers have been posted. The best place to look is www.microsoft.com/sql/index.htm#W. The CD includes a set of lab exercises in a file called Analysis Labs.doc that allow new users to get a feel for the graphical tools in SQL Server 2000. This document includes a lab exercise for exploring the Index Tuning Wizard. To go along with these lab exercises, the CD also contains a sample database called PierrotImports, which is available as two files, PierrotImports.mdf and PierrotImports_log.ldf. You can use the procedure sp_attach_db to add the PierrotImports database to your system. An example of using this command is in the file Attach_Pierrot.sql. Finally, the CD includes a sample workload file (PierrotQueries.sql) composed of SQL statements that access the PierrotImports database.

When the wizard gets a workload file, it tunes the first 32,767 parsable queries and produces nine reports. You can choose to make the analysis more exhaustive or less. You can choose to have the wizard keep all existing indexes, or you can have it come up with a totally new set of index recommendations, which might mean dropping some or all of your existing indexes. Finally, you can choose whether or not the Index Tuning Wizard will recommend building indexed views.

After the wizard generates its reports, which you can save to disk, you can implement the suggestions immediately, schedule the implementation for a later time, or save the script files for the creation (and optional dropping) of indexes. Once the scripts are saved, you can run them later or just inspect them to get ideas of possible indexes to create.

You can also use the wizard purely for analyzing your existing design. One of the reports, Index Usage Report (Current Configuration), shows what percentage of the queries in the submitted workload make use of each of your existing indexes. You can use this information to determine whether an index is being used at all and get rid of any unused indexes along with their space and maintenance overhead.

The current version of the wizard cannot completely tune cross-database queries. It inspects the local tables involved in the queries for index recommendations, but it ignores the tables in other databases. Also, if you're using Unicode data, the wizard does not recognize Unicode constants (such as N'mystring') in your queries and cannot optimize for them. Future versions of the wizard will address these issues and probably add new features as well.

NOTE
If you're going to use SQL Profiler to capture a workload file, you should consider tracing events at several periods throughout the day. This can give you a more balanced picture of the kinds of queries that are actually being executed in your database. If you define your trace using SQL Profiler procedures, you can set up a SQL Server Agent job to start and stop the trace at predefined times, appending to the same workload file each time.

You can also run the Index Tuning Wizard from a command prompt. This capability allows you to define the tuning activities you're interested in but schedule the activity for a later time when the system might not be so busy. Any command prompt activity can be a job step for the SQL Server Agent to run at a predetermined schedule, or you can use the operating system's scheduler to run the command. For details, look up the itwiz utility in SQL Server Books Online.