Index Organization | Inside Microsoft SQL Server 2005: The Storage Engine (Solid Quality Learning)

Think of the indexes you might see in your everyday lifethose in books and other documents. In this era of massive amounts of online information and extremely fast online search engines, it's probably rare that you actually open a book to look for information, so you might have to stretch your imagination a bit to follow my analogy here. But assume that all you have is printed reference materials. Suppose that you're trying to create a view in SQL Server using the CREATE VIEW statement and you're using two SQL Server references to find out how to write the statement. One document is a Microsoft SQL Server 2005 Transact-SQL Language Reference Manual. The other is Inside Microsoft SQL Server 2005: T-SQL Programming. You can quickly find information in either book about views, even though the two books might be organized completely differently.

In the Transact-SQL language reference, all the commands and keywords are organized alphabetically. You know that CREATE VIEW will be near the front with all the other CREATE STATEMENTS, so you can just ignore the last 70 percent of the book. Keywords and phrases are shown at the top of each page to tell you what commands are on that page. So you can quickly flip through just a few pages and end up at a page that has CREATE TYPE as the first key phrase and CREATE XML INDEX as the last keyword, and you know that CREATE VIEW will be on this page. If CREATE VIEW is not on this page, it's not in the book (which is highly unlikely). And once you find the entry for CREATE VIEW, right after CREATE USER, you'll find complete syntax for this command.

Next you try to find CREATE VIEW in the Inside Microsoft SQL Server book. There are no helpful keywords at the top of each page, but there's an index at the back of the book, and all the index entries are organized alphabetically. So again, you can make use of the fact that CREATE VIEW is near the front of the alphabet and find it quickly. However, unlike in the reference manual, once you find the words CREATE VIEW, you won't see nice neat examples right in front of you. The index only gives you pointersit tells you what pages to look at. It might list two or three pages in the book. If you look up SELECT in the book's index, however, you might find dozens of pages listed. And if you look up sp_addumpdevice (a completely deprecated command), you won't find it at all.

These searches through the two books are analogous to using clustered and nonclustered indexes. In a clustered index, the data is actually stored in order, just as the reference manual has all the main topics in order. Once you find the data you're looking for, you're done with the search. In a nonclustered index, the index is a completely separate structure from the data itself. Once you find what you're looking for in the index, you have to follow pointers to the actual data. A nonclustered index in SQL Server is very much like the index in the back of a book. If only one page reference is listed in the index, you can quickly find the information. If a dozen pages are listed, it takes quite a bit longer to find what you need. If hundreds of pages are listed, you might think there's no point in using the index at all. Unless you can narrow your topic of interest to something more specific, the index might not be of much help.

In Chapter 2, I explained that indexes in SQL Server store their information using standard B-trees, as shown in Figure 7-1. A B-tree provides fast access to data by searching on a key value of the index. B-trees cluster records with similar keys. The B stands for balanced, and balancing the tree is a core feature of a B-tree's usefulness. The trees are managed, and branches are grafted as necessary, so navigating down the tree to find a value and locate a specific record takes only a few page accesses. Because the trees are balanced, finding any record requires about the same amount of resources, and retrieval speed is consistent because the index has the same depth throughout.

Figure 7-1. A B-tree for a SQL Server index

Note that although Figure 7-1 shows two intermediate levels, the number will actually vary. An index consists of a tree with a single root page from which the navigation begins, possible intermediate index levels, and bottom-level leaf pages. You use the index to find the correct leaf page. The number of intermediate levels in an index will vary depending on the number of rows in the table and the size of the index rows. If you create an index using a large key, fewer entries will fit on a page, so more pages (and possibly more levels) will be needed for the index. On a qualified select, update, or delete, the correct leaf page will be the lowest page of the tree in which one or more rows with the specified key or keys reside. A qualified operation is one that affects only specific rows that satisfy the conditions of a WHERE clause, as opposed to one that accesses the whole table. In any index, whether clustered or nonclustered, the leaf level contains every key value (or combination of values, for a composite index), in key sequence. The biggest difference between clustered and nonclustered indexes is what else is in the leaf level.

Clustered Indexes

The leaf level of a clustered index contains the data pages, not just the index keys. So the answer to the question "What else is in the leaf level of a clustered index besides the key value?" is "Everything else"that is, all the columns of every row are in the leaf level. Another way to say this is that the data itself is part of the clustered index. A clustered index keeps the data in a table ordered around the key. The data pages in the table are kept in a doubly linked list called a page chain. (Note that pages in a heap are not linked together.) The order of pages in the page chain, and the order of rows on the data pages, is the order of the index key or keys. Deciding which key to cluster on is an important performance consideration. When the index is traversed to the leaf level, the data itself has been retrieved, not simply pointed to.

Because the actual page chain for the data pages can be ordered in only one way, a table can have only one clustered index. The query optimizer strongly favors a clustered index in many cases because such an index allows the data to be found directly at the leaf level. Because it defines the logical order of the data, a clustered index allows especially fast access for queries that look for a range of values. The query optimizer detects that only a certain range of data pages must be scanned.

Most tables should have a clustered index. If your table will have only one index, it generally should be clustered. Many documents describing SQL Server indexes will tell you that the clustered index physically stores the data in sorted order. This can be misleading if you think of physical storage as the disk itself. If a clustered index had to keep the data on the actual disk in a particular order, it could be prohibitively expensive to make changes. If a page got too full and had to be split in two, all the data on all the succeeding pages would have to be moved down. Sorted order in a clustered index simply means that the data page chain is logically in order. If SQL Server follows the page chain, it can access each row in clustered index key order, but new pages can be added simply by adjusting the links in the page chain. I'll tell you more about page splitting and moving rows later in the chapter when I discuss data modification.

In SQL Server 2005, all clustered indexes are unique. If you build a clustered index without specifying the UNIQUE keyword, SQL Server guarantees uniqueness internally by adding a uniqueifier to the rows when necessary. This uniqueifier is a 4-byte value added as an additional clustered index key column. It is added only to the rows that have duplicates of their declared index key column(s). You'll be able to see this extra value when we look at the actual structure of index rows later in this chapter.

Nonclustered Indexes

In a nonclustered index, the leaf level does not contain all the data. In addition to the key values, each index row in the leaf level (the lowest level of the tree) contains a bookmark that tells SQL Server where to find the data row corresponding to the key in the index. A bookmark can take one of two forms. If the table has a clustered index, the bookmark is the clustered index key for the corresponding data row. If the table is a heap (in other words, it has no clustered index), the bookmark is a row identifier (RID), which is an actual row locator in the form File#:Page#:Slot#.

The presence or absence of a nonclustered index doesn't affect how the data pages are organized, so you're not restricted to having only one nonclustered index per table, as is the case with clustered indexes. Each table can include as many as 249 nonclustered indexes, but you'll usually want to have far fewer than that.

When you search for data using a nonclustered index, the index is traversed and then SQL Server retrieves the record or records pointed to by the leaf-level indexes. For example, if you're looking for a data page using a seek operation on an index with a depth of threea root page, one intermediate page, and the leaf pageall three index pages must be traversed. If the leaf level contains a clustered index key, all the levels of the clustered index must then be traversed to locate the specific row. The clustered index will probably also have three levels, but in this case remember that the leaf level is the data itself. There are two additional index levels separate from the data, typically one less than the number of levels needed for a nonclustered index. The data page still must be retrieved, but because it has been exactly identified, there's no need to scan the entire table. Still, it takes six logical I/O operations to get one data page. You can see that a nonclustered index is a win only if it's highly selective.

Figure 7-2 illustrates this process without showing you the individual levels of the B-trees. We'll see much more detailed illustrations of the content of clustered and nonclustered index pages in the section titled "The Structure of Index Pages." I want to find the first name for the employee named Anson, and I have a nonclustered index on the last name and a clustered index on the employee ID. The nonclustered index uses the clustered keys as its bookmarks. Searching the index for Anson, SQL Server finds that the associated clustered index key is 7. It then traverses the clustered index looking for the row with a key of 7, and it finds Kim as the first name in the row I'm looking for.

Figure 7-2. Traversing both the clustered and nonclustered indexes to find the first name for the employee named Anson