Overview of Indexing | MCSE Training Guide (70-229). Designing and Implementing Databases with MicrosoftR SQL ServerT 2000 Enterprise Edition

Implement indexing strategies.

One of the most difficult database tasks to be performed is choosing a working indexing strategy to accompany the system's business's needs. This chapter gives you the detailed insight you need to understand indexes and how to create them for high-power applications and databases. This chapter will acquaint you with the types of indexes available in SQL Server 2000, and how to create and alter them. You will discover a strategy for selecting appropriate columns to include in an index. You will also look at tools that you can use to analyze and improve on existing index strategies.

Simply put, indexes are database objects that provide efficient access to data in rows in tables. Indexes may also function to provide uniqueness throughout certain rows in your table. Through SQL Server's built-in query optimizer, SQL Server is able to select from a table's many indexes and choose a few of them that will execute a query the fastest .

Indexes can significantly boost performance in your database. In fact, after database design, the next thing you'll want to consider for improving the performance of database searches should be indexes. Fortunately, indexes are not hard to implement.

To understand indexing and apply appropriate indexes to any project, it is necessary to have an understanding of the physical storage constructs SQL Server uses. Data and index storage concepts are discussed in the following section.

Indexing and Data Storage

You can compare indexing and data storage in SQL server to the same material in a book. Pages, data sequencing, topic pointers, and additional descriptive data can be found in SQL Server storage, and you can understand them if you understand the principles at work in a book.

In both SQL Server and in a book, the physical data and indexes are stored on pages. The pages of the book store data in the form of topics and descriptions. In SQL Server, pages are also the fundamental unit of data storage. Data pages contain all the data in data rows except text , ntext , and image data, which is stored in separate pages. Data rows are placed on the page starting after a page header.

There are many types of pages in SQL Server used to store data in a variety of ways. The list that follows shows the eight types of pages in the data files of a SQL Server 2000 database:

Data. Stores data rows with all data except text , ntext , and image data.
Index. Stores index entries.
Text/Image. Stores data for text , ntext , and image data types.
Global Allocation Map. Information about allocated extents.
Page Free Space. Information about free space available on pages.
Index Allocation Map. Information about extents used by a table or index.
Bulk Changed Map. Information about extents modified by bulk operations since the last BACKUP LOG statement.
Differential Changed Map. Information about extents that have changed since the last BACKUP DATABASE statement.

Each page used to store data in SQL Server contains a number of separate elements. Data, freespace, page header, and row offsets all make up a portion of a page, as illustrated in Figure 10.1.

Figure 10.1. Layout of a data page.

graphics/10fig01.gif

NOTE

Extent Storage Extents are the basic unit in which space is allocated to tables and indexes. An extent is 8 contiguous pages, or 64KB. This means SQL Server 2000 databases have 16 extents per megabyte.

Pages represent the elementary structure for storing all SQL Server object information and data. Pages also represent the basis for all index and data interactions. In the table of contents and topical index at the back of a book, you have numerical pointers to the pages where the data is actually located. SQL Server indexes serve a similar purpose, pointing to the location of the data on specific data pages.

Indexing Architecture

Indexes in databases are similar to indexes in books. In a book, an index enables you to find information quickly without reading the entire book. In a database, an index enables the database program to find data in a table without scanning all the data in the entire table. An index in a book is a list of words with the page numbers that contain each word. An index in a database is a list of values in a table with the storage locations of rows in the table.

Indexes act as very flexible mechanisms for providing order and structure to a table. An index can be created on either a single column or a combination of columns in a table. Indexes can provide the physical order of data storage or can simply allow for alternate methods of sequencing data. Indexes also can enable faster searching of data to improve user response time and general system performance.

SQL Server supports indexes defined on any column in a table, including computed columns. If a table is created with no clustered index, the data rows are not stored in any particular order. This structure is called a heap . If order is supplied by a clustered index, then heap storage is not used. In this case, the data is stored in a given sequence based on the clustered index.

SQL Server supports indexing of two basic types, although there are many variations of those types. The two types of SQL Server indexes are clustered and nonclustered indexes. Clustered indexes are implemented so that the logical order of the key values determines the physical order of the corresponding rows in a table. Nonclustered indexes are indexes in which the logical order of the index is different than the physical, stored order of the rows on disk. The following sections examine them both.

Clustered Indexes

As seen in the previous sections, a clustered index is a type of index in which the logical order of the key values determines the physical order of the data stored on disk. Because a clustered index shows how the physical data of a table is stored, there can be only one clustered index per table.

The selection of appropriate column(s) on which to base a clustered index is particularly important. Column(s) should be chosen where range searches are frequently performed or based upon the expected order of the majority of reports printed from the data. A range search occurs within a WHERE conditional operation, as in selecting the range of all authors who have an advance greater than $500 but less than $4000. With this type of range search, the index, if present, first locates the smallest value (500), and then it locates the other values alongside that value until the last value is reached (4000).

IN THE FIELD: PRIMARY KEY CONSIDERATION

The clustered index is defined by default for the Primary Key when you define a table. However, you may often want to have a column other than the Primary Key be the basis for the physical order of the data. To accommodate this design, you may want to create a nonclustered Primary Key. This is beneficial where surrogate keys are used, because reports are rarely going to be placed in order of the surrogate.

Clustered indexes also save time in searches for a specific row when the values on a column are unique. For example, if you have an SSN column on an Employees table and you place a clustered index on this column, you should notice an increase in data retrieval speed when performing searches on SSN. Naturally, the index on SSN does no good when searching for last name = "Doe" .

Candidates for clustered indexes have the following characteristics:

A column queried using range operators such as BETWEEN , < , or > .
Columns that are grouped using the GROUP BY clause or that involve joins or that represent the ORDER of displayed or printed output.
Queries and reporting that access sequential data.
Operations that result in large resultsets.
Columns that don't have incremented values; for example, an ID column has incremented values.
Columns with many duplicate values.
Columns that don't have incremented values; for example, an ID column has incremented values.
On the PK when there are many inserts causing all records to be inserted on the same page, which can remain in memory.

In SQL Server 2000, clustered indexes are implemented as b-trees, as shown in Figure 10.2. Each index page holds a header that is followed by index rows. Each of these rows contains a pointer value to another sub-page or a row of data. The lowest level of implementation in clustered indexes is known as the leaf page or node and contains data rows and pages. The first level of implementation is known as the root node . Anything between the root node and leaf node is referred to as intermediate-level pages .

Figure 10.2. Clustered indexes are implemented in a b-tree structure.

graphics/10fig02.gif

NOTE

Clustering versus Heap The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. If a table has no clustered index, its data rows are stored in a heap.

Because you can create only one clustered index, the majority of indexes created are nonclustered indexes. A nonclustered index can operate over an existing clustered index or over a data heap.

Examining Nonclustered Indexes

Nonclustered indexes are indexes in which the logical order of the key values in the index is different than the physical order of the rows in the indexed table. Nonclustered indexes are also implemented as b-tree structures, as shown in Figure 10.3. There are two major differences that distinguish them from clustered indexes. First, the leaf node of the nonclustered index does not contain the concrete data. It contains index rows having a nonclustered key value that point to a data row. Second, as already noted, nonclustered indexes do not physically rearrange the data.

Figure 10.3. Nonclustered indexes are also structured in b-tree format.

graphics/10fig03.gif

When you query data on a column with a nonclustered index, SQL Server 2000 first tries to find a record containing your table object ID and a value in the indid column from the sysindexes table that is >= 1 and < 255, because a 0 implies there is no index at all, and 255 implies that the table has Text/Image pages. After it finds the record, SQL Server examines the root column to find the root page of the index. After SQL Server finds the root page, it can begin a search for your data.

The SQL Server 2000 optimizer chooses an index to use for the search. The Keys column in sysindexes lists what columns are involved in each index. If the optimizer knows this and knows what columns you are searching on, it can use index statistics to determine which index to use. In the simplest case, if you are searching on only one column, A, and only one index exists on column A, then it uses that index. When searching on multiple columns, and many or all of these columns are indexed, it uses statistical methods to determine the index.

It's important to understand the following guidelines before you create nonclustered indexes:

Nonclustered indexes are good when queries don't return large resultsets.
Columns that are used in a SELECT list or the WHERE clause are good candidates for nonclustered indexes.
A nonclustered index is the default index in SQL Server.
There can be up to 249 nonclustered indexes on a single table.
Nonclustered indexes generally take more overhead than clustered indexes, so you should create them only when it is absolutely necessary.
Nonclustered indexes make a good choice for use as a covering index.

NOTE

Index Default Nonclustered index is the default index created if no particular type is specified.

SQL Server searches through the index to find a particular value and then follows that pointer to the row containing that value. An index implemented in a SQL Server database is an internal table structure that holds a list of ordered values from one or more columns of a table. Column selection should be made carefully .

What to Index?

So, which columns do you really need to index? Column selectivity is a major step in the process of indexing. In general, columns that are frequently accessed by the WHERE or the ORDER BY clauses should be considered . Indexes should never be added to a column unless you know they actually will help.

When you build indexes, try to narrow indexes down to the minimum number of columns needed. Multicolumn indexes act negatively on performance. Columns with unique values serving as Primary Keys are also good candidates. The more unique rows there are in a column, the more ideal it is for an index. If a column has 50 rows, 25 of which have distinct unique values in one column, then you have a column that has 40 distinct values. The former serves as a better candidate because there are more unique values based on the number of rows.

EXAM TIP

Covering Indexes Be careful on the exam to pay close attention to WHERE and ORDER clause examples that don't have matching or covering indexes. Increased performance and decreased response time may be achieved by adding an index in these cases.

The challenge for the database designer is to build a physical data model that provides efficient data access. This can be done by minimizing I/O processing time. The following types of columns are good ones to index:

A column that acts as the table's Primary or Foreign Key.
Columns that are regularly sorted by the ORDER BY clause.
Columns that are filtered on an exact condition using the WHERE clause. For instance, WHERE state= 'Ca' .
Columns that are queried on joins.
Columns that hold integer values rather than character values.
Searches for rows with search key values in a range of values. For example, WHERE Royalty BETWEEN 1000 and 4000 .
Queries that use the like clause may only benefit if they start with character data. Examine the following: WHERE au_fname LIKE 'sm%' .

Indexes may also be chosen by a factor known as index selectivity . The selectivity is the percentage of rows accessed by an UPDATE , SELECT , or DELETE statement. A query has high selectivity when it is filtered on exact criteria, where all values are unique. For example, a user may execute a query that selects the emp_id of an employee from the Employees table where emp_id equals 112.

WARNING

Narrowing Indexes Keep your indexes as narrow (indexing using very few columns in the search key) as possible, especially when creating composite indexes. This reduces the size of the index and minimizes the amount of reads needed to read the index, therefore boosting performance. This is also why integers are generally better to index than characters , because integers are generally smaller than character fields. The real goal is to keep the byte size of index rows as small as possible.

A query characterized by low selectivity is one in which the searched column has few unique values, such as a search against the contract column (having values of 0 and 1) in the authors table. Indexes placed on columns having high selectivity can make searches significantly more efficient. The classic scenario of a low selectivity column is gender. You should never build an index on gender because there are only two possibilities.

To find the selectivity of a query, use the following formula:

[View full width]

  [View full width] 
 Selectivity ratio = [100 * (Total number of distinct index rows)] / (Total number of rows  in the table)]

Formulas may help in making a selection in very large database scenarios. The true test of any index strategy occurs when queries are processed during day-to-day operations. Strong guidelines also exist on things that should not be indexed.

What Not to Index?

You really can't and shouldn't index all the columns in your table. Doing so would significantly drop performance on inserts and deletes, even though most queries would run fast. When determining whether to index a small table, look to see whether it requires more page reads to scan the index than there are pages in the table. In this case an index will hurt performance, not help it. Therefore, a table with less than three pages is not helped by any index. Learn to use the SQL Server Query Analyzer tool as a guide for whether an index is useful or not. Recognize table scans ; the process of reading all records from a table in sequence may take fewer cycles than accessing an index first ”particularly on small tables.

Here are some conditions that would indicate you should not index a column:

If the index is never used by the query optimizer.
If the column values exhibit low selectivity, often greater than 95% for nonclustered indexes.
If the column(s) to be indexed are very wide.
If the table is rarely queried.
Do not index columns that are not used in where clauses, aggregated, or used in sorting.

Indexes provide many trade-offs. Although queries may show a performance improvement, INSERT , UPDATE , and DELETE operations could see a decline in performance. You may not know the power of indexes until you perform large searches on tables having tens of thousands of rows. Implementing an indexing strategy would not be proper for a small database with a few tables containing no more than 50 rows. Tables are benefited by indexing for the following reasons:

As many as 249 nonclustered indexes can be created on a single table.
Indexes help increase efficiency on complex searches on large tables.
Indexes are easy to implement.
Sixteen columns can comprise the same composite index, as long as the total key length is less than 900 bytes.
Indexes may be used to enforce uniqueness throughout rows in tables.

NOTE

What's New in Indexes New to SQL Server 2000 is the power to index on computed columns and the capability to create indexes on views. You can specify whether indexes are created in ascending or descending order. Also, the CREATE INDEX statement can now use the Tempdb database as a work area for the sorts required to create an index. This results in improved disk read and write patterns for the index creation step.

Nonclustered Covering Indexes

A covering index is a nonclustered index that is built upon all the columns needed to satisfy a SQL query, both in the selection criteria and the WHERE clause. Covering indexes save a huge amount of I/O and build a lot of efficiency in a query. For instance, if you query SQL Server with the following statement:

 SELECT au_fname, au_lname FROM authors WHERE state= 'ca'

You can run the following to create in index:

 CREATE NONCLUSTERED INDEX MyIndex ON Authors(state, au_fname, au_lname)

MyIndex would be considered a covering index because a composite index is built on all columns specified in the SELECT statement and WHERE clause.

Covering indexes significantly boost performance because all the data for the query is contained within the index; only the index pages, not the data pages, of the table must be referenced to retrieve the data.

If your query plan shows a bookmark lookup, that is always a sign that a covering index should be considered. Bookmark lookups indicate that the query plan has located the rows needed in an index, but it then has to jump to the datapages to get information from additional columns outside the index. Adding the additional rows to the index eliminates the bookmark lookups, which are often expensive operations. Of course, in some cases the additional columns make the index so inefficient you loose out on the deal, but nonetheless bookmark lookups always indicate a location where a covering index should be considered.

Covering indexes don't work for clustered indexes because the leaf pages are the datapages themselves , and all columns are already there.

An indexing strategy will undergo many changes as the life cycle of the database system continues. Close monitoring of performance and application activity, as well as other significant database events, will help you decide on appropriate changes to be made in the future.

REVIEW BREAK: Indexing Strategies

An indexing strategy is an integral part of a database system implementation. Appropriate design increases query performance and reduces user response times. Inappropriate design unnecessarily slows down data additions, modifications, and deletions. A balance between the two that meets business needs is essential.

One clustered index per table can be used and, if implemented, it determines the physical order of the data. Nonclustered indexes act like those in the back of a book ”pointing out the physical location of the data. Nonclustered covering indexes can be created in cases where the exact query content is known.

Keys, ranges, and unique values are strong selections for index candidates. Seldom-used data, binary data, and repeating values are poor index candidates. After index candidates have been selected, monitor application usage. Adjust the indexing strategy on a regular basis to provide reliably high performance.