Index Design Guidelines | Microsoft SQL Server 2000 Unleashed (2nd Edition)

SQL Server indexes are mostly transparent to end users and T-SQL developers. Indexes are typically not specified in queries unless the user uses table hints to force the optimizer to use a particular index (although this is not advised ”optimizer table hints will be covered in more detail in Chapter 35). Normally, based on the index key histogram values, the SQL Server cost-based optimizer chooses the index that is least expensive from an I/O standpoint.

Chapter 35 goes into greater detail on how the optimizer estimates I/O and determines the most efficient query plan. In the meantime, this section presents some of the main guidelines to follow to create useful indexes within your environment that the optimizer can use effectively.

Some general guidelines to follow when designing your indexes are as follows :

For composite indexes, try to keep the more selective columns leftmost in the index. The first element in the index should be the most unique (if possible), and index column order in general should be from most to least unique. However, remember that selectivity doesn't help if the first ordered index column is not specified in your SARGs or join clauses. To ensure that the index is used for the largest number of queries, the first ordered column should be the column used most often in your queries.
Be sure to index columns used in joins. Joins are processed inefficiently if no index on the column(s) is specified in a join. Remember: A PRIMARY KEY constraint will automatically create an index on a column, but a FOREIGN KEY constraint will not. You will want to create indexes on your foreign key columns if your queries commonly join between the primary key and foreign key tables.
Tailor your indexes for your most critical queries and transactions. You cannot index for every possible query that might be run against your tables. However, your applications will perform better if you can identify your critical and most frequently executed queries and design indexes to support them. SQL Profiler, which is covered in Chapter 7, is a useful tool for identifying your most frequently executed queries. SQL Profiler can also help identify slow-running queries that might benefit from a better index design.
Avoid indexes on columns that have poor selectivity. The optimizer will likely not use the indexes and simply take up space and add unnecessary overhead during inserts , updates, and deletes. One possible exception is when the index can be used to cover a query. Index covering is discussed in more detail in the "Index Covering" section later in this chapter.
Choose your clustered and nonclustered indexes carefully . The next two sections discuss tips and guidelines for choosing between clustered or nonclustered indexes based on the data contained in the columns and the types of queries executed against the columns.

Clustered Index Indications

Searching for rows via a clustered index is almost always faster than searching for rows via a nonclustered index for two reasons. One reason is that a clustered index contains only pointers to pages rather than pointers to individual data rows; therefore, a clustered index is more compact than a nonclustered index. Because a clustered index is smaller and doesn't require an additional bookmark lookup to find the matching rows, the rows can be found with fewer page reads than with a similarly defined nonclustered index. The second reason is that because the data in a table with a clustered index is physically sorted on the clustered key, searching for duplicate values or for a range of clustered key values is faster; the rows are adjacent to each other and SQL Server can simply locate the first qualifying row and then search the rows in sequence until the last qualifying row is found. However, because you are only allowed to create one clustered index per table, you must judiciously choose which column or columns on which to define the clustered index.

If you require only a single index on a table, it's typically advantageous to make it a clustered index; the resulting overhead of maintaining clustered indexes during updates, inserts, and deletes can be considerably less than the overhead incurred by nonclustered indexes.

By default, the primary key on a table is defined as a clustered unique index. In most applications, the primary-key column on a table is almost always retrieved in single-row lookups. For single-row lookups, a nonclustered index usually costs you only a few I/Os more than a similar clustered index. Are you or the users really going to notice a difference between three page reads to retrieve a single data row versus four to six page reads to retrieve a single data row? Not at all. However, if you have to perform a range retrieval, such as a lookup on last name , will you notice a difference between scanning 10 percent of the table versus having to find the rows using a full table scan? Most definitely.

With this in mind, you might want to consider creating your primary key as a unique nonclustered index and choosing another candidate for your clustered index. The following are guidelines to consider for other potential candidates for the clustered index:

Columns with a number of duplicate values that are searched frequently, for example, WHERE last_name = 'Smith' .

Because the data is physically sorted, all the duplicate values are kept together. Any query that tries to fetch records against such keys will find all the values using a minimum number of I/Os. SQL Server locates the first row that matches the SARG and then scans the data rows in order until it finds the last row matching the SARG.
Columns that are often specified in the ORDER BY clause.

Because the data is already sorted, SQL Server can avoid having to re- sort the data if the ORDER BY is on the clustered index key and the data is retrieved in clustered key order. Remember: Even for a table scan, the data will be retrieved in clustered key order because the data in the table is in clustered key order. The only exception is if a parallel query operation is used to retrieve the data rows ”the results will need to be re-sorted when the resultsets from each parallel thread are merged. (For more information on parallel query strategies, see Chapter 35.)
Columns that are often searched for within a range of values, for example, WHERE price between $10 and $20 .

The clustered index can be used to locate the first qualifying row in the range of values. Because the rows in the table are in sorted order, SQL Server can simply scan the data pages in order until it finds the last qualifying row within the range. When the resultset within the range of values is large, a clustered index scan is significantly more efficient in terms of total logical I/Os performed than repeated bookmark lookups via a nonclustered index.
Columns, other than the primary key, that are frequently used in join clauses.

Clustered indexes tend to be smaller than nonclustered indexes; the number of page I/Os required per lookup will generally be less than for a nonclustered index (see Chapter 33 for a detailed discussion of index structures and sizes). This can be a significant difference when joining many records. An extra page read or two might not seem like much for a single-row retrieval, but add those additional page reads to 100,000 join iterations, and you're looking at a total of 100,000 to 200,000 additional page reads.

When you consider columns for a clustered index, you might want to try to keep your clustered indexes on relatively static columns to minimize the re-sorting of data rows when an indexed column is updated. Any time a clustered index key value changes, all nonclustered indexes using the clustered key as the bookmark to that row also need to be updated.

Try to avoid creating clustered indexes on sequential key fields that are inserted monotonically, such as on an identity column. This can create a "hot spot" at the end of the table that results in possible locking contention and deadlocks at the end of the table and the index. Additionally, the clustered index will not be reusing available space on preceding data pages because all new rows sort to the end of the table. This situation results in wasted space and your table growing larger than anticipated. The general recommendation is that you try to cluster on a data value that's somewhat randomly distributed throughout your table. Try to choose a clustered key that spreads the insert and update activity across the table but also benefits your other queries as well. Some candidates for clustered index keys to randomize your data include the following:

Date of birth
Last name, first name
ZIP Code
A random hash key (usually used only when no other actual data columns are good clustered index candidates)

Spreading your data throughout the table helps to minimize page contention, as well as provide more efficient space utilization. If the sequential key is your primary key, you can still use a unique, nonclustered index to provide an access path via the index and maintain the uniqueness of the primary key.

Because you can physically sort the data in a table in only one way, you can have only one clustered index. Any other columns you want to index have to be defined with nonclustered indexes.

Nonclustered Index Indications

SQL Server 2000 allows you to create a maximum of 249 nonclustered indexes on a table. Until tables become extremely large, the actual space taken by a nonclustered index is a minor expense compared to the increased access performance. Always keep in mind, however, that as you add more indexes to the system, database modification statements get slower due to the index maintenance overhead.

Also, when defining nonclustered indexes, you typically want to define indexes on columns that are more selective (that is, columns with low density values) so that they can be used effectively by the optimizer. A high number of duplicate values in a nonclustered index can often make it more expensive (in terms of I/O) to process the query using the nonclustered index than a table scan. Let's look at a hypothetical example:

 select title from titles     where price between . and .

Assume that you have 1,000,000 rows within the range; those 1,000,000 rows could be randomly scattered throughout the table. Although the index leaf level has all the index rows in sorted order, reading all data rows one at a time would require a separate bookmark lookup for each row in the worst-case scenario.

Thus, the worst-case I/O estimate for range retrievals using a nonclustered index is as follows:

number of levels in the nonclustered index

+ number of index pages scanned to find all matching rows

+ number of matching rows x the number of pages per bookmark lookup

If you have no clustered index on the table, the bookmark is simply a page and row pointer and requires one data page read to find the matching data row. If 1,000,000 rows are in the range, the worst-case cost estimate to search via the nonclustered index with no clustered index on the table would be as follows:

The number of index page reads to find all the bookmarks

+ 1,000,000 matching rows x 1 data page read

= 1,000,000+ I/Os

If you have a clustered index on the table, the bookmark is a clustered index key for the data row. Using the bookmark to find the matching row requires searching the clustered index tree to locate the data row. Assuming that the clustered index has two non-leaf levels, it would cost three pages to find each qualifying row on a data page. If the range has 1,000,000 rows, the worst-case cost estimate to search via the nonclustered index with a clustered index on the table would be as follows:

The number of index page reads to find all the bookmarks

+ 1,000,000 matching rows x 3 pages per bookmark lookup

= 3,000,000+ I/Os

Contrast each of these scenarios with the cost of a table scan. If the entire table takes up 50,000 pages, a full table scan would cost only 50,000 I/Os. Therefore, in this example, a table scan would actually be more efficient than using the nonclustered index.

The following guidelines help you identify potential candidates for nonclustered indexes for your environment:

Columns referenced in SARGs or join clauses that have a relatively high selectivity (the density value is low).
Columns referenced in both the WHERE clause and the ORDER BY clause.

When the data rows are retrieved using a nonclustered index, they are retrieved in nonclustered index key order. If the resultset is to be ordered by the nonclustered index key(s) as well, SQL Server can avoid having to re-sort the resultset, resulting in a more efficient query. The following query is an example where SQL Server can avoid the extra step of sorting the resultset if a nonclustered index is on state and the index is used to retrieve the matching rows:
```
 select * from authors    where state like "c%"     order by state 
```

In general, nonclustered indexes are useful for single-row lookups, joins, queries on columns that are highly selective, or queries with small range retrievals. Also, when considering your nonclustered index design, don't overlook the benefits of index covering, as described in the following section.

Index Covering

Index covering is a situation where all the information required by the query in the SELECT and WHERE clauses can be found entirely within the nonclustered index itself. Because the nonclustered index contains a leaf row corresponding to every data row in the table, SQL Server can satisfy the query from the leaf rows of the nonclustered index. This results in faster retrieval of data because all the information can come directly from the index page, and SQL Server avoids lookups of the data pages.

Because the leaf pages in a nonclustered index are linked together, the leaf level of the index can be scanned just like the data pages in a table. Because the leaf index rows are typically much smaller than the data rows, a nonclustered index that covers a query will be faster than a clustered index on the same columns, due to the fewer number of pages that would need to be read.

In the following example, a nonclustered index on the au_lname and au_fname columns of the authors table would cover the query because the result columns, as well as the SARGs, can all be derived from the index itself:

 Select au_lname, au_fname     From authors    Where au_lname like "M%" Go

Many other queries that use an aggregate function (such as MIN , MAX , AVG , SUM , and COUNT ) or simply check for existence of a criteria also benefit from index covering. The following queries are examples of aggregate queries that can take advantage of index covering:

 select count(au_lname) from authors where au_lname like 'm%'  select count(*) from authors where au_lname like 'm%' select count(*) from authors

You might be wondering how the last query, which doesn't even specify a SARG, can still use an index. SQL Server knows that by its nature, a nonclustered index contains a row for every data row in the table; it can simply count all the rows in any of the nonclustered indexes instead of scanning the whole table. For the last query, SQL Server chooses the smallest nonclustered index ”that is, the one with the fewest number of leaf pages.

Adding columns to nonclustered indexes to get index covering to occur is a common method of improving query response time. Consider the following query:

 select royalty from titles     where price between  and

If you create an index on only the price column, SQL Server could find the rows in the index where price is between $10 and $20, but it would have to access the data rows to retrieve royalty . With 100 rows in the range, the worst-case I/O cost to retrieve the data rows would be as follows:

The number of index levels

+ The number of index pages to find the matching rows

+ 100 x the number of pages per bookmark lookup

If the royalty column were added to the index on the price column, the index could be scanned to retrieve the results instead of having to perform the bookmark lookups against the table, resulting in faster query response. The I/O cost using index covering would be only

The number of index levels

+ The number of index pages to find the matching rows

CAUTION

When considering padding your indexes to take advantage of index covering, beware of making the index too wide. As index row width approaches data row width, the benefits of covering are lost as the number of pages in the leaf level increases. As the number of leaf level index pages approaches the number of pages in the table, the number of index levels also increases , and index scan time begins to approach table scan time.

Also, if you add columns that are updated frequently to an index, any changes to the columns in the data rows cascade into the indexes as well. This increases the index maintenance overhead, which can adversely impact update performance.

As discussed in Chapter 33, when you have a clustered index defined on a table, the clustered key is carried into all the nonclustered indexes to be used as the bookmark to locate the actual data row. The clustered key is the actual columns that make up the clustered index and their data values. This "feature" can sometimes result in index covering when it is not expected.

For example, assume that the authors table has a clustered index on au_lname and au_fname , and a nonclustered primary key defined on au_id . Each row in the nonclustered index on au_id would contain the clustered key values for au_lname and au_fname for its corresponding data row. Because of this, the following query would actually be covered by the nonclustered index on au_id :

 select au_lname, au_fname     from authors    where au_id like '123%'

Composite Indexes Versus Multiple Indexes

As your index key gets wider, the selectivity of the key generally becomes higher as well. It might appear as if creating wide indexes should result in better performance. This is not necessarily true. The reason is that the wider the key, the fewer rows SQL Server stores on the index pages, requiring more pages at each level, resulting in a higher number of levels in the index B-tree. To get to specific rows, SQL Server must perform more I/Os.

To get better performance from queries, instead of creating a few wide indexes, consider creating multiple narrower indexes. The advantage here is that with smaller keys, the query optimizer can quickly scan through multiple indexes to create the most efficient access plan. Unlike versions of SQL Server prior to 7.0, SQL Server now has the option of performing multiple index lookups within a single query and merging the resultsets together to generate an intersection of the indexes. Also, with more indexes, the optimizer can choose from a wider variety of query plan alternatives.

If you are considering creating a wide key, check the distribution of values for each member of the composite key individually. If the selectivity on the individual columns is high, you might want to break up the index into multiple indexes. If the selectivity of individual columns is low but is high for combined columns, it makes sense to have wider keys on the table. To get to the right combination, populate your table with real-world data, experiment with creating multiple indexes, and check the distribution of values for each column. Based on the histogram steps and index density, you can make the decisions for an index design that works best for your environment.