Indexing | The Guru[ap]s Guide to SQL Server[tm] Stored Procedures, XML, and HTML

for RuBoard

There are few more beneficial things you can do to speed up query performance than to construct usable, efficient indexes. The name of the game with large data banks is I/O. You want to avoid as much of it as you can. Caching helps. Processing power helps. Fast hard drives help. But nothing affects query performance as fundamentally or as profoundly as indexing.

Without a useful index, SQL Server has little choice but to scan the entire table or tables to find the data you need. If you're joining two or more tables, SQL Server may have to scan some of them multiple times to find all the data needed to satisfy the query. Indexes can dramatically speed up the process of finding data as well as the process of joining tables together.

Storage

The sysindexes system table stores system-level information about SQL Server indexes. Every index has a row in sysindexes. Each index is identified by its indid column, a 1-based integerthe clustered index is always indid 1. If a table has no clustered index, sysindexes will contain a row for the table itself with an indid value of 0.

The Index Allocation Map

SQL Server tracks the extents that belong to a table or index using IAM (Index Allocation Map) pages. A heap or index will have at least one IAM for each file on which has allocated extents. An IAM is a bitmap that maps extents to objects. Each bit indicates whether the corresponding extent belongs to the object that owns the IAM. Each IAM bitmap covers a range of 512,000 pages (8,000 pages * 8 bits/byte * 8 pages/extent = 512,000). The first IAM page for an index is stored in sysindexes' FirstIAM column. IAM pages are allocated randomly in a database file and linked together in a chain. Even though the IAM permits SQL Server to efficiently prefetch a table's extents, individual rows must still be examined. The IAM just serves as an access method to the pages themselves .

Index Types

SQL Server supports two types of indexes: clustered and nonclustered. Both types have a number of features in common. Both consist of pages stored in balanced (B)-trees. The node levels of each type contain pointers to pages at the next level, whereas the leaf level contains the key values.

B-Trees

SQL Server indexes are stored physically as B-trees. B-trees support the notion of searching through data using a binary search-type algorithm. B-tree indexes store keys with similar values close together, with the tree itself being continually rebalanced to ensure that a given value can be reached with a minimum of page traversal. Because B-trees are balanced, the cost of finding a row is fairly constant, regardless of which row it is.

The first node in a B-tree index is the root node. A pointer to each index's root node is stored in sysindexes' root column. When searching for data using an index, SQL Server begins at the root node, then traverses any intermediate levels that may exist, finally either finding or not finding the data in the bottom-level leaf nodes of the index. The number of intermediate levels will vary based on the size of the table, the size of the index key, and the number of columns in the key. Obviously the more data there is or the larger each key is, the more pages you need.

Index pages above the leaf level are known as node pages. Each row in a node page contains a key or keys and a pointer to a page at the next level whose first key row matches it. This is the general structure of a B-tree. SQL Server navigates these linkages until it locates the data for which it's searching or reaches the end of the linkage in a leaf-level node. The leaf level of a B-tree contains key values, and, in the case of nonclustered indexes, bookmarks to the underlying clustered index or heap. These key values are stored sequentially and can be sorted in either ascending or descending order on SQL Server 2000 and later.

Unlike nonclustered indexes, the leaf node of a clustered index actually stores the data itself. There is no bookmark, nor is there a need for one. When a clustered index is present, the data itself lives in the leaf level of the index.

The data pages in a table are stored in a page chain, a doubly linked list of pages. When a clustered index is present, the order of the rows on each page and the order of the pages within the chain are determined by the index key. Given that the clustered index key causes the data to be sorted, it's important to choose it wisely. The key should be selected with several considerations in mind, among them:

The key should be as small as possible because it will serve as the bookmark in every nonclustered index
The key should be chosen such that it aligns well with common ORDER BY and GROUP BY queries
It should match reasonably well with common range queries (queries in which a range of rows is requested based on the values in a column or columns)

Beginning with SQL Server 7.0, all clustered indexes have unique keys. If a clustered index is created without the UNIQUE keyword, SQL Server forces the index to be unique by appending a 4-byte value called a uniqueifier to key values as necessary to differentiate identical key values from one another.

Leaf-level pages in a nonclustered index contain index keys and bookmarks to the underlying clustered index or table. A bookmark can take one of two forms. When a clustered index exists on the table, the bookmark is the clustered index's key. If the clustered index and the nonclustered index share a common key column, it's stored just once. When a clustered index isn't present, the bookmark consists of a RID (a row identifier) made up of the file number, the page number, and the slot number of the row referenced by the nonclustered key value.

The fact that a heap (a table without a clustered index) forces nonclustered indexes to reference it using physical location information is a good enough reason alone to create a clustered index on every table you build. Without it, changes to the table that cause page splits will have a ripple effect on the table's nonclustered indexes because the physical location of the rows they reference will change, perhaps quite often. This was, in fact, one of the major disadvantages of SQL Server indexing prior to version 7.0: Nonclustered indexes always stored physical row locator information rather than the clustered key value, and were thus susceptible to physical row location changes in the underlying table.

Nonclustered indexes are best at singleton selectsqueries that return a single row. Once the nonclustered B-tree is navigated, the actual data can be accessed with just one page I/Othe read of the page from the underlying table.

Covering Indexes

A nonclustered index is said to "cover" a query when it contains all the columns requested by the query. This allows it to skip the bookmark lookup step and simply return the data the query seeks from its own B-tree. When a clustered index is present, a query can be covered using a combination of nonclustered and clustered key columns because the clustered key is the nonclustered index's bookmark. That is, if the nonclustered index is built on the LastName and FirstName columns, and the clustered index key is built on CustomerID, a query that requests the CustomerID and LastName columns can be covered by the nonclustered index. A covering nonclustered index is the next best thing to having multiple clustered indexes on the same table.

Performance Issues

Generally speaking, keep your index keys as narrow as possible. Wider keys cause more I/O and permit fewer key rows to fit on each B-tree page. This results in the index requiring a larger number of pages than it otherwise would, and causes it to take up more disk space. In practice, you'll probably find yourself tailoring your indexing strategy to meet specific business requirements. For example, if you have a query that takes an extremely long time to return because it needs an index with key columns none of your current indexes have, you may indeed want to widen an existing index or create a new one.

Naturally, there's a tradeoff with adding additional indexes or index columnsnamely, update performance. Because the indexes on a table have to be maintained and updated as you add or change data, each new index you add brings with it a certain amount of overhead. The more indexes you add, the slower the updates against the underlying tables become, so it's important to keep your indexes as compact and narrow as possible, while still meeting the business needs your system was designed to address.

Index Intersection

Prior to version 7.0, the SQL Server query optimizer would use just one index per table to resolve a query. SQL Server 7.0 and later can use multiple indexes per table, intersecting their sets of bookmarks before incurring the expense of retrieving data from the underlying table. This has some implications on index design and key selection, as I'll discuss in a moment.

Index Fragmentation

You can control the amount of fragmentation in an index through its fillfactor setting and through regular defragmenting operations. An index's fillfactor affects performance in several ways. First, creating an index with a relatively low fillfactor helps avoid page splits during inserts . Obviously, with pages only partially full, the potential for needing to split one of them to insert new rows is lower than it would be with completely full pages. Second, a high fillfactor can help compact pages so that less I/O is required to service a query. This is a common technique with data warehouses. Because SQL Server performs I/O operations at the extent level, and extents are themselves collections of pages, retrieving pages that are only partially full wastes I/O bandwidth.

An index's fillfactor setting only affects the leaf-level pages in the index. SQL Server normally reserves enough empty space on intermediate index pages to store at least one row of the index's maximum size. If you want your fillfactor specification applied to intermediate as well as leaf-level pages, supply the PAD_INDEX option of the CREATE TABLE statement. PAD_INDEX instructs SQL Server to apply the fillfactor to the intermediate-level pages of the index. If your fillfactor setting is so high that there isn't room on the intermediate pages for even a single row (e.g., a fillfactor of 100%), SQL Server will override the percentage so that at least one row fits. If your fillfactor setting is so low that the intermediate pages cannot store at least two rows, SQL Server will override the fillfactor percentage on the intermediate pages so that at least two rows fit on each page.

Understand that an index's fillfactor setting isn't maintained over time. It's applied when the index is first created, but is not enforced afterward. DBCC SHOWCONTIG is the tool of choice for determining how full the pages in a table and/or index really are. The key indicators you want to examine are "Logical Scan Fragmentation" and "Avg. Page Density." DBCC SHOWCONTIG shows three types of fragmentation: extent scan fragmentation, logical scan fragmentation, and scan density. Use DBCC INDEXDEFRAG to fix logical scan fragmentation; rebuild indexes to defrag the table and/or index completely.

Listing 17-1 shows some sample DBCC SHOWCONTIG output from the Northwind Customers table:

Listing 17-1 DBCC SHOWCONTIG.

 DBCC SHOWCONTIG (Customers)

(Results)

 DBCC SHOWCONTIG scanning 'Customers' table... Table: 'Customers' (2073058421); index ID: 1, database ID: 6 TABLE level scan performed. - Pages Scanned................................: 5 - Extents Scanned..............................: 3 - Extent Switches..............................: 4 - Avg. Pages per Extent........................: 1.7 - Scan Density [Best Count:Actual Count].......: 20.00% [1:5]  - Logical Scan Fragmentation   ..................   : 40   .   00%  - Extent Scan Fragmentation ...................: 66.67% - Avg. Bytes Free per Page.....................: 3095.2  - Avg. Page Density (full).....................:   61.76%

As you can see, the Customer table is a bit fragmented . Logical Scan Fragmentation is sitting at 40% and Avg. Page Density is at 61.76%. In other words, the pages in the table are, on average, approximately 40% empty. Let's defrag the table's clustered index and see if things improve any (Listing 17-2):

Listing 17-2 The Customer table after defragmentation.

 DBCC INDEXDEFRAG(Northwind,Customers,1)

(Results)

 Pages Scanned Pages Moved Pages Removed ------------- ----------- ------------- 1             0           1

(Results)

 DBCC SHOWCONTIG (Customers)

(Results)

 DBCC SHOWCONTIG scanning 'Customers' table... Table: 'Customers' (2073058421); index ID: 1, database ID: 6 TABLE level scan performed. - Pages Scanned................................: 4 - Extents Scanned..............................: 3 - Extent Switches..............................: 2 - Avg. Pages per Extent........................: 1.3  -   Scan Density [Best Count:Actual Count]   .......   : 33   .   33% [1:3  ]  - Logical Scan Fragmentation   ..................   : 25   .   00%  - Extent Scan Fragmentation ...................: 66.67% - Avg. Bytes Free per Page.....................: 1845.0  - Avg. Page Density (full).....................:   77.21%

As you can see, DBCC INDEXDEFRAG helped considerably. Logical Scan Fragmentation has dropped to 25% and Avg. Page Density is now at a little more than 77%an improvement of approximately 15%.

By default, DBCC SHOWCONTIG reports leaf-level info only. To scan the other levels of the table/index, specify the ALL_LEVELS option (Listing 17-3):

Listing 17-3 DBCC SHOWCONTIG can show fragmentation at all levels.

 DBCC SHOWCONTIG (Customers) WITH TABLERESULTS, ALL_LEVELS

(Results abridged)

 ObjectName IndexName    AveragePageDensity  ScanDensity       LogicalFragmenta ---------- ------------ ------------------- ------------------ ------------------ Customers  PK_Customers 77.205337524414063  33.333333333333329 25.0 Customers  PK_Customers 0.95132195949554443 0.0                0.0

Table 17-1 lists the key data elements reported by DBCC SHOWCONTIG and what they mean:

I use the Logical Scan Fragmentation and Avg. Page Density fields to determine overall table/index fragmentation. You should see them change in tandem as fragmentation increases or decreases over time.

Table 17-1. DBCC SHOWCONTIG FIELDS

SHOWCONTIG field	Meaning
Avg. Bytes Free per Page	Average number of bytes free on each page
Pages Scanned	Number of pages accessed
Extents Scanned	Number of extents accessed
Out of order pages (not displayed, but used to compute Logical Scan Fragmentation)	Number of times a page had a lower page number than the previous page in the scan
Extent Switches	Number of times a page in the scan was on a different extent than the previous page in the scan

Defragmenting

As you just saw, DBCC INDEXDEFRAG is a handy way to defragment an index. It's an online operation, so the index is still usable while it works. That said, it only reorganizes the index at the leaf level, performing a kind of bubble sort on the leaf-level pages. To fully defrag an index, you must rebuild it. You have several ways of doing this. First, you could simply drop and recreate it using DROP/CREATE INDEX. The drawback to this, though, is that you have to take the index offline while you rebuild it, and you aren't allowed to drop indexes that support constraints. You could use DBCC DBREINDEX or the DROP_EXISTING clause of CREATE INDEX, but, again, the index is unavailable until it's rebuilt. The one upside is that the index can be created in parallel if you're running on the Enterprise Edition of SQL Server. Because SQL Server's parallel index creation scales almost linearly on multiple processors, the length of time that an index is offline while being recreated can be decreased significantly by simply adding more processors.

Generally speaking, DBCC INDEXDEFRAG is the best tool for the job unless you find widespread fragmentation in the nonleaf levels of the index and you feel this fragmentation is unacceptably affecting query performance. As I mentioned, you can check the fragmentation of the other levels of an index by passing the ALL_LEVELS option to DBCC SHOWCONTIG.

In addition to defragmenting the leaf-level pages, DBCC INDEXDEFRAG also features a compaction phase in which it compacts the index's pages using the original fillfactor as its target. It attempts to leave enough space for at least one row on each page when it finishes. If it can't obtain a lock on a particular page during compaction, it skips the page. It removes any pages that end up completely empty as a result of the compaction.

Indexes on Views and Computed Columns

Creating an index on a view or a computed column persists data that would otherwise exist only in a logical sense. Normally, the data returned by a view exists only in the tables the view queries. When you query the view, your query is combined with the one comprising the view and the data is retrieved from the underlying objects. The same is true for computed columns. Normally, the data returned by a computed column does not actually exist independently of the columns or expressions it references. Every time you request it from its host table, the expression that comprises it is reevaluated and its data is generated "on the fly."

When you begin building indexes on a view, you must start with a clustered index. This is where the real data persistence happens. Just as with tables, a clustered index created on a view actually stores the data itself in its leaf-level nodes. Once the clustered index exists, you're free to created nonclustered indexes on the view as well.

Table 17-2. Indexed View/Computed Column Required Settings

Setting	Required value
ARITHABORT	ON
CONCAT_NULL_YIELDS_NULL	ON
QUOTED_IDENTIFIER	ON
ANSI_NULLS	ON
ANSI_PADDING	ON
ANSI_WARNING	ON
NUMERIC_ROUNDABORT	OFF

This differs from computed columns in tables. With computed columns, you're not required first to create a clustered index to build nonclustered indexes. Because the column will serve merely as an index key value, a nonclustered index works just fine.

Prerequisites

SQL Server requires that seven specific SET options be specified correctly to create an index on a view or a computed column. Table 17-2 lists the settings and their required values. As you can see from the table, all settings except NUMERIC_ROUNDABORT must be set to ON.

Only deterministic expressions can be used with indexed views and indexes on computed columns. A deterministic expression is one that when supplied a given input, always returns the same output. SELECT SUBSTRING('He who loves money more than truth will end up poor',23,7) is a deterministic expression; GETDATE() isn't.

You can check to see whether a view or column can be indexed using Transact-SQL's OBJECTPROPERTY() and COLUMNPROPERTY() functions (Listing 17-4):

Listing 17-4 Not all views can be indexed.

 SELECT OBJECTPROPERTY (OBJECT_ID('Invoices'), 'IsIndexable') SELECT COLUMNPROPERTY (OBJECT_ID('syscomments'), 'text' , 'IsIndexable') SELECT COLUMNPROPERTY (OBJECT_ID('syscomments'), 'text' , 'IsDeterministic')

(Results)

 ----------- 0 ----------- 0 ----------- 0

One final prerequisite for views is that a view can only be indexed if it was created with the SCHEMABINDING option. Creating a view with SCHEMABINDING causes SQL Server to prevent the objects it references from being dropped unless the view is first dropped or changed so that the SCHEMABINDING option is removed. Also, ALTER TABLE statements on tables referenced by the view will fail if they affect the view definition.

In the previous example, the Invoice view cannot be indexed because it was not created with SCHEMABINDING. Here's a version of it that was, along with a subsequent check of IsIndexable (Listing 17-5):

Listing 17-5 The Invoices2 view can now be indexed.

 CREATE VIEW Invoices2 WITH SCHEMABINDING AS SELECT Orders.ShipName, Orders.ShipAddress, Orders.ShipCity,        Orders.ShipRegion,        Orders.ShipPostalCode, Orders.ShipCountry, Orders.CustomerID,        Customers.CompanyName AS CustomerName, Customers.Address,        Customers.City,        Customers.Region, Customers.PostalCode, Customers.Country,        Orders.OrderID, Orders.OrderDate, Orders.RequiredDate,        Orders.ShippedDate, Shippers.CompanyName As ShipperName,        [Order Details].ProductID, Products.ProductName,        [Order Details].UnitPrice, [Order Details].Quantity,        [Order Details].Discount,        Orders.Freight        FROM   dbo.Shippers INNER JOIN            (dbo.Products INNER JOIN               (                 (dbo.Employees INNER JOIN                   (dbo.Customers INNER JOIN dbo.Orders                     ON Customers.CustomerID = Orders.CustomerID)                 ON Employees.EmployeeID = Orders.EmployeeID)               INNER JOIN dbo.[Order Details]                 ON Orders.OrderID = [Order Details].OrderID)            ON Products.ProductID = [Order Details].ProductID)        ON Shippers.ShipperID = Orders.ShipVia GO SELECT OBJECTPROPERTY (OBJECT_ID('Invoices2'), 'IsIndexable')

(Results)

 ----------- 1

Note that all the object references now use two-part names (they don't in the original Invoices view). Creating a view with SCHEMABINDING requires all object references to use two-part names .

Once a view has been indexed, the optimizer can make use of the index when the view is queried. In fact, on SQL Server EE, the optimizer will even use the index to service a query on the view's underlying tables if it thinks this would yield the least cost in terms of execution time.

Normally, indexed views are not used at all by the optimizer unless you're running on EE. For example, consider this index and query:

 CREATE UNIQUE CLUSTERED INDEX inv ON invoices2 (orderid, productid) GO SELECT * FROM invoices2 WHERE orderid=10844 AND productid=22

(Results abridged)

 ShipName          ShipAddress  ShipCity ShipRegion ShipPostalCode ----------------- ------------ -------- ---------- -------------- Piccolo und mehr  Geislweg 14  Salzburg NULL       5020

Here's an excerpt from its query plan (Listing 17-6):

Listing 17-6 View indexes aren't used by default on non-EE versions of SQL Server.

 StmtText --------------------------------------------------------------------------- SELECT * FROM [invoices2] WHERE [orderid]=@1 AND [productid]=@2   --Nested Loops(Inner Join)        --Nested Loops(Inner Join)            --Nested Loops(Inner Join, OUTER REFERENCES:([Orders].[ShipVia]))                --Nested Loops(Inner Join, OUTER REFERENCES:([Orders].[Employ                    --Nested Loops(Inner Join, OUTER REFERENCES:([Orders].[C                        --Clustered Index Seek(OBJECT:([Northwind].[dbo].[O                        --Clustered Index Seek(OBJECT:([Northwind].[dbo].[C                    --Clustered Index Seek(OBJECT:([Northwind].[dbo].[Employ                --Clustered Index Seek(OBJECT:([Northwind].[dbo].[Shippers].[            --Clustered Index Seek(OBJECT:([Northwind].[dbo].[Order Details].[        --Clustered Index Seek(OBJECT:([Northwind].[dbo].[Products].[PK_Product

Although the plan text is clipped on the right, you can tell that the view index obviously isn't being used, even though it contains both the columns on which the query filters. On versions of SQL Server other than EE, this is completely expected. Out of the box, only SQL Server EE will consider view indexes when formulating an execution plan. There is, however, a workaround. You can use the NOEXPAND query hint on non-EE versions of SQL Server to force the consideration of a view's index. Here's the query again (Listing 17-7), this time with the NOEXPAND keyword and the resultant query plan:

Listing 17-7 NOEXPAND forces the use of a view's index.

 SELECT * FROM invoices2 (NOEXPAND) WHERE orderid=10844 AND productid=22 StmtText --------------------------------------------------------------------------- SELECT * FROM invoices2 (NOEXPAND) WHERE orderid=10844 AND productid=22 --Clustered Index Seek(OBJECT:([Northwind].[dbo].[Invoices2].[inv]), SEEK:([

Notice that the index is now used. In conjuction with the INDEX hint, NOEXPAND can force the optimizer to use a view's index, even if doing so yields a suboptimal plan, so regard it with the same skepticism that you do other query hints. It's best to let the optimizer do its job and override it only when you have no other choice.

Locking and Indexes

One telltale sign that a table lacks a clustered index is when you notice that it has RID locks taken out on it. SQL Server will never take out RID locks on a table with a clustered index; it will always take out key locks instead.

Generally speaking, you should let SQL Server control locking of all types, including locking with indexes. Normally, it makes good decisions and will do the best job of managing its own resources.

You can use the sp_indexoption system procedure to control manually the types of locks that are allowed on an indexed table. You can use it to disable row and/or page locks, which can come in handy when SQL Server's locking facilities refuse to work correctly, particularly with regard to lock escalation.

Note that sp_indexoption only applies to indexes, so you can't control the locking with the pages in a heap. That said, when a table has a clustered index, it is affected by the settings specified via sp_indexoption.

for RuBoard