SQL Server 2005 Index Internals


It doesn't generally help database developers to have an in-depth understanding of the query engine or the storage engine without also having an in-depth understanding of how data in tables is indexed and how those index structures affect query performance.

There are two types of indexes in SQL Server. In NonClustered indexes, the data in the index is a pointer to the actual data, as shown in Figure B-14. In Clustered indexes, the data stored in the table column is physically stored in the index page, as shown in Figure B-15.

image from book
Figure B-14: The SQL Server 2005 NonClustered Index Structure

image from book
Figure B-15: The SQL Server 2005 Clustered Index Structure

As discussed earlier, the most basic unit of storage in SQL Server is a data page. Data pages are organized into extents, and extents are referenced by tables. In SQL Server 2005, tables are organized into one or more partitions, and each partition is stored in either a heap, which is simply a non-grouped data page, or clustered index structure. (See Chapter 7, "Optimizing SQL Server 2005 Performance," Lesson 2, "Optimizing Index Strategies," for more information on indexes.) The pages within the heap or clustered index structure are organized into allocation units, depending on the type of data contained within the row of data. (For example, TEXT and IMAGE data are stored in different allocation units than VARCHAR data.)

SQL Server 2005 Index Organization

SQL Server 2005 supports a single Clustered Index per table and up to 249 NonClustered indexes per table. Indexes in SQL Server 2005 utilize a structure known as a Balanced Tree (B-Tree). A B-Tree structure is a structure where navigating from the root of the structure to any point should take exactly the same number of intermediate steps no matter what end point is chosen. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom level of nodes in the index is called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows. Each index row contains a key value and a pointer to either an intermediate-level page in the B-tree, or a data row in the leaf level of the index. The pages in each level of the index are linked in a doubly linked list. Data is either organized into a Clustered Index or a Heap.

Heap Structures

A heap is a table without a clustered index. Heaps have one row in sys.partitions, with index_id = 0 for each partition used by the heap. By default, a heap has a single partition. When a heap has multiple partitions, each partition has a heap structure that contains the data for that specific partition.

Depending on the data types in the heap, each heap structure will have one or more allocation units to store and manage the data for a specific partition. At a minimum, each heap will have one IN_ROW_DATA allocation unit per partition. The heap will also have one LOB_DATA allocation unit per partition, if it contains large object (LOB) columns. It will also have one ROW_OVERFLOW_DATA allocation unit per partition, if it contains variable-length columns that exceed the 8,060-byte row size limit.

The column first_iam_page in the sys.system_internals_allocation_units system view points to the first Index Allocation Map (IAM) page in the chain of IAM pages that manage the space allocated to the heap in a specific partition. SQL Server 2005 uses the IAM pages to move through the heap. The data pages and the rows within them are not in any specific order and are not linked. The only logical connection between data pages is the information recorded in the IAM pages.

Table scans or serial reads of a heap can be performed by scanning the IAM pages to find the extents that are holding pages for the heap. Because the IAM represents extents in the same order that they exist in the data files, this means that serial heap scans progress sequentially through each file. Using the IAM pages to set the scan sequence also means that rows from the heap are not typically returned in the order in which they were inserted.

Clustered Index Structures

Clustered indexes have one row in sys.partitions, with index_id = 1 for each partition used by the index. By default, a clustered index has a single partition. When a clustered index has multiple partitions, each partition has a B-tree structure that contains the data for that specific partition. For example, if a clustered index has four partitions, there are four B-tree structures, one in each partition.

Depending on the data types in the clustered index, each clustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each clustered index will have one IN_ROW_DATA allocation unit per partition. The clustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns. It will also have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable-length columns that exceed the 8,060-byte row size limit.

The pages in the data chain and the rows in them are ordered on the value of the clustered index key. All inserts are made at the point where the key value in the inserted row fits in the ordering sequence among existing rows.

For a clustered index, the root_page column in sys.system_internals_allocation_units points to the top of the clustered index for a specific partition. SQL Server moves down the index to find the row corresponding to a clustered index key. To find a range of keys, SQL Server moves through the index to find the starting key value in the range and then scans through the data pages using the previous or next pointers. To find the first page in the chain of data pages, SQL Server follows the leftmost pointers from the root node of the index.

Non-Clustered Index Structures

Non-clustered indexes are very similar in structure to clustered indexes with the following key differences:

  • The data rows of the underlying table are not sorted and stored in order based on their non-clustered keys.

  • The leaf layer of a non-clustered index is made up of index pages instead of data pages.

non-clustered indexes can be defined on a table or view with a clustered index or a heap. Each index row in the non-clustered index contains the non-clustered key value and a row locator. This locator points to the data row in the clustered index or heap having the key value.

The row locators in non-clustered index rows are either a pointer to a row or are a clustered index key for a row, as explained below:

  • If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).

  • If the table has a clustered index, or the index is on an indexed view, the row locator is the clustered index key for the row. If the clustered index is not a unique index, SQL Server 2005 makes any duplicate keys unique by adding an internally generated value called a uniqueifier. This four-byte value is not visible to users. It is only added when required to make the clustered key unique for use in non-clustered indexes. SQL Server retrieves the data row by searching the clustered index using the clustered index key stored in the leaf row of the non-clustered index.

Non-clustered indexes have one row in sys.partitions with index_id >1 for each partition used by the index. By default, a non-clustered index has a single partition. When a non-clustered index has multiple partitions, each partition has a B-tree structure that contains the index rows for that specific partition. For example, if a non-clustered index has four partitions, there are four B-tree structures, with one in each partition.

Depending on the data types in the non-clustered index, each non-clustered index structure will have one or more allocation units in which to store and manage the data for a specific partition. At a minimum, each non-clustered index will have one IN_ROW_DATA allocation unit per partition that stores the index B-tree pages. The NonClustered index will also have one LOB_DATA allocation unit per partition if it contains large object (LOB) columns. Additionally, it will have one ROW_OVERFLOW_DATA allocation unit per partition if it contains variable-length columns that exceed the 8,060-byte row size limit. For more information about allocation units, see Table and Index Organization at http://msdn2.microsoft.com/en-us/library/ms189051.aspx. The page collections for the B-tree are anchored by root_page pointers in the sys.system_internals_allocation_units system view.

Included Columns Index

In SQL Server 2005, you can extend the functionality of non-clustered indexes by adding non-key columns to the leaf level of the non-clustered index. By including non-key columns, you can create non-clustered indexes that cover more queries. This is because the non-key columns have the following benefits:

  • They can be data types not permitted as index key columns.

  • They are not considered by the database engine when calculating the number of index key columns or index key size.

An index with included non-key columns can significantly improve query performance when all columns in the query are included in the index either as key or non-key columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed, resulting in fewer disk I/O operations.

Understanding the physical structure of indexes in SQL Server 2005 helps database developers to optimize both the creation and use of proper indexes specific to their application.

Optimizing Index Structures

One of the biggest challenges database developers face is how to optimize their indexes to obtain the most efficient query response. When developing indexes for database applications, the following guidelines should be considered:

  • Always look at the query plan first. It will show you the optimal current execution plan from the query engine's point of view. Find the most expensive part of the execution plan and start optimizing from there. However, even before that, make sure that the statistics on all tables in your query are up-to-date by running the update statistics command on all tables in your query.

  • If you see table scan, optimize. Table scan is the slowest possible way of execution. Table scan means not only that no index is used, but that there is no clustered index for this table at all. Even if you can only replace table scan with clustered index scan, it is still worth it.

  • If you see clustered index scan, find out whether it can be replaced with index seek. For that, find the conditions applied to this table. Usually, conditions exist for two or three fields of the table. Find out the most selective condition (that is, the condition that would produce the smallest number of records if applied alone), and see whether an index on this field exists. Any index that lists this field first will qualify. If there is no such index, create it and see whether the query engine picks it up.

  • If the query engine is not picking up the existing index (that is, if it is still doing a clustered index scan), check the output list. It is possible that seek on your index is faster than clustered index scan, but involves bookmark lookup that makes the combined cost greater than use of a clustered index. Clustered index operations (scan or seek) never need bookmark lookup, because a clustered index already contains all the data. If the output list is not big, add those fields to the index and see whether the query engine picks it up. Please remember that the combined size is more important than the number of fields. Adding three integer fields to the index is less expensive than adding one varchar field with an average data length of 20.

  • If you see bookmark lookup, it means that your index is not covering. Try to make it covering if it makes sense. (See the preceding bullet.) The execution plan selected by the query engine might not be the best one. The query engine makes certain assumptions about disk subsystem and CPU cost versus IO cost. These assumptions can sometimes be incorrect. If you don't believe that the query engine's selection is the best one, run a query in the loop for 10 to 15 minutes with automatic selection, change the query to use your index (you will have to use index hint to force it), and then run it for 10 to 15 minutes again. Compare the results to see which one works better.

  • Avoid any calculations or operations on the columns where possible. Some operations will prevent the use of the index on this field even if it exists-for example, using the LTRIM or RTRIM functions on string data will seriously affect performance. For example, instead of using the condition cast (DateField as varchar(20)) = @dateString, try to convert @dateString to an expression of datetime type first, and then compare it to DateField. When it is not possible to avoid functions or calculations on the column, use an index built on that expression. This can be done in two ways:

    • Create a calculated field based on your expression.

    • Create a view and build an index on it.

Indexed views are a good way to further increase the speed of the query if you are not satisfied with the results. Indexed view is a clustered index built over the view's select list. You can also define additional indexes for the indexed view, just as you can for any regular table. Indexed views take disk space and have some maintenance overhead (every time underlying tables change, the indexed view also has to change), but they usually provide a good boost in performance, even after all other optimization techniques are exhausted.

Full-Text Indexes

A full-text index is another type of index used in SQL Server 2005. Full-text indexes differ from "normal" SQL Server indexes because they are stored outside the confines of the database storage engine. Full-text search facilitates fast and flexible indexing for keyword-based query of text data stored in a SQL Server database. Unlike the T-SQL LIKE predicate, which only works on character patterns, full-text queries perform a linguistic search against this data, operating on words and phrases based on rules of a particular language.

The performance benefit of using full-text search can be best realized when querying against a large amount of unstructured text data. A T-SQL LIKE query against millions of rows of text data can take minutes to return; whereas, a full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.

Full-text indexes might be built not just on columns that contain text data, but also against formatted binary data, such as Microsoft Word documents, stored in a BLOB-type column; in these cases, it is not possible to use the LIKE predicate for keyword queries.

With the growing popularity of storing and managing textual data in a database, full-text indexes have become more common. Common uses of full-text search include Web-based applications, document-management systems, and other applications that need to provide text-search capabilities over data stored in a SQL Server database.

The full-text engine for SQL Server is what enables SQL Server 2005 to utilize full-text indexes. This engine is in turn built on Microsoft Search technology and is tightly integrated into the SQL Server 2005 Database Engine. For more information on the full-text search engine or full-text indexes, see the MSDN article "SQL Server 2005 Full Text Search: Internals and Enhancements" at http://msdn2.microsoft.com/en-us/library/ms345119.aspx.

XML Indexes

XML index is another new index type that developers need to be aware of in SQL Server 2005. SQL Server 2005 introduced the XML data type that can be used to store structured and unstructured XML data within a column in a SQL Server database. XML documents can be very large. (SQL Server can store up to 2GB in an XML column.) XML data can be queried using new XQuery functionality in SQL Server 2005. To return results efficiently, SQL Server 2005 now supports a special type of index known as an XML index.

XML indexes are broken down into two types:

  • Primary XML Index. A primary XML index is a representation of the nodes in the XML data stored in the column. Primary XML indexes are physically stored as additional rows in the XML column. A primary XML index is very similar in function and structure to a Clustered index.

  • Secondary XML Index. A secondary XML index is an additional index that is either created on the PATH, PROPERTY, or VALUE nodes within the primary XML index. Secondary XML indexes are very similar in function and structure to a NonClustered index.

Primary XML Indexes

The primary XML index is a shredded and persisted representation of the XML BLOBs in the xml data type column. For each XML binary large object (BLOB) in the column, the index creates several rows of data. The number of rows in the index is approximately equal to the number of nodes in the XML binary large object. Each row stores the following node information:

  • Tag name, such as an element or attribute name.

  • Node value.

  • Node type, such as an element node, attribute node, or text node.

  • Document order information, represented by an internal node identifier.

  • Path from each node to the root of the XML tree. This column is searched for path expressions in the query.

  • Primary key of the base table. The primary key of the base table is duplicated in the primary XML index to enable efficient join operations with the base table.

This node information is used to evaluate and construct XML results for a specified query. For the purposes of optimization, the tag name and the node type information are stored as integer values.

The query processor uses the primary XML index for queries that involve xml data type methods and returns either scalar values or the XML subtrees from the primary index itself.

Secondary XML Indexes

If your queries generally specify path expressions on xml type columns, a PATH secondary index might be able to speed up the search. As described earlier, the primary index is helpful when you have queries that specify the exist() method in the WHERE clause. If you add a PATH secondary index, you might also improve the search performance.

Although a primary XML index avoids having to shred the XML binary large objects at run time, it might not provide the best performance for queries based on path expressions. Because all rows in the primary XML index corresponding to an XML BLOB are searched sequentially for large XML instances, the sequential search might be slow. In this case, having a secondary index built on the path values and node values in the primary index can significantly speed up the index search. In the PATH secondary index, the path and node values are key columns that enable more efficient seeks when searching for paths.

If queries are value based, for example, /Root/ProductDescription/@*[. = "WidgetA"] or //ProductDescription[@Name = "Widget A"], and the path is not fully specified or it includes a wildcard, you might obtain faster results by building a secondary XML index that is built on node values in the primary XML index.

The key columns of the VALUE secondary index are (node value and path) of the primary XML index. If your workload involves querying for values from XML instances without knowing the element or attribute names that contain the values, a VALUE secondary index might be useful.

Queries that retrieve one or more values from individual XML instances might benefit from a PROPERTY secondary index. This occurs when you retrieve object properties by using the value() method of the xml type and when the primary key value of the object is known. The PROPERTY secondary index is built on columns (PK, path, and node value) of the primary XML index, where PK is the primary key of the base table. For more information on XML indexes and their usage, see the MSDN Article "XML Indexes in SQL Server 2005" at http://msdn2.microsoft.com/en-us/library/ms345121.aspx.




MCITP Self-Paced Training Kit Exam 70-442  .Designing and Optimizing Data Access by Using Microsoft SQL Server 2005
MCITP Self-Paced Training Kit Exam 70-442 .Designing and Optimizing Data Access by Using Microsoft SQL Server 2005
ISBN: 073562383X
EAN: N/A
Year: 2007
Pages: 162

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net