Creating Indexes to Provide Faster Query Execution | Microsoft SQL Server 2005: Applied Techniques Step by Step (Pro-Step by Step Developer)

Now we will have a look at how indexes work and how they improve query performance. In SQL Server, it is possible to define two different types of indexes: clustered indexes and nonclustered indexes. To understand how indexes can improve data access and which index type to use for a given situation, it is essential to understand how data and indexes are stored in data files and how SQL Server accesses data in data files.

Heap Structures

A data file in a SQL Server database is divided in 8k pages. Every page can contain data, indexes, or other types of data that SQL Server needs to maintain the data files. However, the majority of the pages are data or index pages. Pages are the units that SQL Server writes and reads from the data files. Every page contains data or index information for only one database object. So on each data page, you'll find the data of a single object, and on each index page, you'll find information for only one index. In SQL Server 2000, it was not possible for a data row to be split between pages, which meant that a data row had to fit on one page, limiting the row size to about 8060 bytes (with the exception of large object data). In SQL Server 2005, this restriction no longer exists for varying-length datatypes like nvarchar, varbinary, CLR, and so on. With varying-length datatypes, data rows can span several pages, but all the fixed-length datatypes of a row still have to fit on one page.

When you create a table without any index and fill it with data, SQL Server searches for unused pages on which to store the data. To keep track of which pages hold the data for the table, SQL Server has one or more IAM (Index Allocation Map) pages for every table. These IAM pages point to the pages where the table's data is stored. Since the data for this table is stored on the pages without any index and is only kept together through IAM pages, the table is called a heap. To access data in a heap, SQL Server has to read the IAM page of that table and then scan all pages pointed to by the IAM page. Such an operation is called a table scan. Table scans read all of the data without any order. If a query searches for one particular row, a table scan of a heap will have to read all the rows in the table just to find it, which is a very inefficient operation.

Examining Heap Structures

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	In the following examples, you will create two tables called dbo.Orders and dbo.OrderDetails. Type and execute the following statements to build the tables and populate them with data. The code for this entire example is included in the sample files as Examining Heap Structures.sql. USE AdventureWorks GO CREATE TABLE dbo.Orders( SalesOrderID int NOT NULL, OrderDate datetime NOT NULL, ShipDate datetime NULL, Status tinyint NOT NULL, PurchaseOrderNumber dbo.OrderNumber NULL, CustomerID int NOT NULL, ContactID int NOT NULL, SalesPersonID int NULL ); CREATE TABLE dbo.OrderDetails( SalesOrderID int NOT NULL, SalesOrderDetailID int NOT NULL, CarrierTrackingNumber nvarchar(25), OrderQty smallint NOT NULL, ProductID int NOT NULL, UnitPrice money NOT NULL, UnitPriceDiscount money NOT NULL, LineTotal AS (isnull((UnitPrice((1.0)-UnitPriceDiscount))OrderQty,(0.0))) ); INSERT INTO dbo.Orders SELECT SalesOrderID, OrderDate, ShipDate, Status, PurchaseOrderNumber, CustomerID, ContactID, SalesPersonID FROM Sales.SalesOrderHeader; INSERT INTO dbo.OrderDetails(SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber, OrderQty, ProductID, UnitPrice, UnitPriceDiscount) SELECT SalesOrderID, SalesOrderDetailID, CarrierTrackingNumber,OrderQty, ProductID, UnitPrice, UnitPriceDiscount FROM Sales.SalesOrderDetail;
3.	Now you have built the two tables with the heap storage structure. Type the following statements to query the dbo.Orders table. Include the actual execution plan by pressing Ctrl+M prior to execution or by selecting Include Actual Execution Plan from the Query menu. Execute the query. SET STATISTICS IO ON; SELECT * FROM dbo.Orders SET STATISTICS IO OFF The SET STATISTICS IO option turns on a feature that causes SQL Server to send messages about the I/O performed while executing a statement back to the client. This is an excellent feature to use to determine the I/O cost of queries.
4.	Switch to the Messages tab. You will see a message similar to the following This output tells us that SQL Server needed to scan the data of the table one time and needed to perform178 page reads (logical reads) for this operation. The output also shows us that no physical reads were used (physical reads or read-ahead reads) to perform the operation. There were no physical reads because, in this case, the data was already in the buffer cache. If your Messages window indicates that there were physical reads with that query, execute the query again and you should see that the number of physical reads is lower than before. This is because SQL Server keeps recently accessed data pages in the buffer cache to increase performance.
5.	Switch to the Execution Plan tab. In the execution plan, shown below, you see that SQL Server used a Table Scan operation to access the data since that was its only choice.
6.	Now change the query a little bit to retrieve a specific row. SET STATISTICS IO ON; SELECT * FROM dbo.Orders WHERE SalesOrderID =46699; SET STATISTICS IO OFF;
7.	Examine the message output and the graphical execution plan. You will see that SQL Server still needed 178 page reads and used a Table Scan operation for the query. The table scan is used because SQL Server has no index and therefore needs to scan all data in the table to find the right row. You can see that SQL Server uses table scans to access tables without any index. This scan forces SQL Server to scan all data regardless of the size of the table. In a very large table, a table scan could take a long time.

Indexes on Tables

To improve data access performance, you should define indexes on columns. The columns where an index is defined are called the index key columns. An index built on a column is similar to a book index. It includes the sorted values from that column and pointers to the pages where the actual rows of data can be found.

To find the rows where the indexed column has a specific value, SQL Server has to search the index for that value and then follow the pointer to read the row. This operation is much easier and cheaper than scanning all of the data as we did when using the table scan.

Indexes in SQL Server are built in a tree structure called a balanced tree. The general structure of a balanced tree can be seen in Figure 6-1. As you can see, the bottom level is called the leaf level. You can think of the leaf level as a book index. It includes an entry for every data row, sorted by the index column. To find values quickly within the index, a tree is built upon it using the < (less than) and > (greater than) comparison operations. The number of index levels depends on the number of rows and the size of the index key. In reality, an index page would contain many more entries than you see in Figure 6-1. Since a page is 8k in size, SQL Server can point to hundreds of pages from one index page. Therefore, indexes normally don't have many levels, even when the table contains millions of rows. This fact makes it possible to find specific values very quickly.

Figure 6-1. The general structure of a B-tree.

As mentioned before, SQL Server has two types of indexes: clustered and nonclustered indexes. Both types are balanced trees, but each is built differently. Let us now have a look at what the differences are.

Clustered Indexes

Clustered indexes are a special kind of balanced tree. The difference from what you have seen before is found in the leaf level of the index. In a clustered index, the leaf level does not include the index key and a pointer; instead, the leaf level is the data itself. This difference means that the data is not stored in a heap structure anymore. It is now stored in the leaf level of the index, sorted by the index key. This design has two advantages:

SQL Server doesn't need to follow a pointer to access the data. The data is stored directly in the index.
The data is sorted by the index key, which is the main advantage. Whenever SQL Server needs the data sorted by the index key, performing a sort operation is no longer necessary because it is sorted already.

Because the data is included in a clustered index, it is only possible to define one clustered index per table. To create a clustered index, the following syntax is used:

CREATE [ UNIQUE ] CLUSTERED INDEX index_name     ON <object> ( column [ ASC | DESC ] [ ,...n ] )

As you can see, it is possible to define indexes as unique, which means that no two rows are permitted to have the same index key value.

Note

SQL Server builds a unique index when a primary or unique constraint is created on a table. When a primary key is defined it builds a clustered index by default if no clustered index exists on the table. The kind of index SQL Server should use when creating a primary or unique constraint can be defined within the CREATE or ALTER TABLE statements with the CLUSTERED or NONCLUSTERED keyword.

Creating and Using Clustered Indexes

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	Type and execute the following statement to create a unique clustered index on the Orders table. The code for this example is included in the sample files as Creating And Using Clustered Indexes.sql. CREATE UNIQUE CLUSTERED INDEX CLIDX_Orders_SalesOrderID ON dbo.Orders(SalesOrderID)
3.	Now execute the same two SELECT statements as before and examine the differences. Be sure to include the actual query plan when executing. SET STATISTICS IO ON; SELECT * FROM dbo.Orders; SELECT * FROM dbo.Orders WHERE SalesOrderID =46699; SET STATISTICS IO OFF;
4.	Switch to the Execution Plan tab. What you see here is that SQL Server is no longer using table scans. Now SQL Server performs index operations, because the data is no longer stored in a heap structure. The query execution plan shows that you are using the two main index operations that exist: Index scan A scan through the leaf level of the index, reading all the data of the table. Because the first SELECT statement has no WHERE clause, SQL Server knows that it needs to retrieve all data, which is stored in the leaf level of the index. Index seek An operation in which SQL Server searches for a particular value by passing through the branches of the index, beginning at the root of the index. These two operations can also be combined to retrieve particular ranges of data. In this type of partial scan operation, SQL Server seeks to get to the beginning of the range and then scans until the end of the range.
5.	Switch to the Messages tab, shown below. What you see is that the first SELECT statement produced almost the same amount of page reads as the table scan did when you had the heap structure. This is not surprising because the SELECT statement requests all data and therefore SQL Server has to retrieve all data. But the second query only produced two page reads, which is a major improvement over the 178 page reads you saw before. SQL Server only needed to seek through the index, which requires much less I/O than searching through every page of data.
6.	Enter the following SELECT statements, which retrieve the data in a sorted form, and press Ctrl+L to get the estimated execution plan. SELECT * FROM dbo.Orders ORDER BY SalesOrderID; SELECT * FROM dbo.Orders ORDER BY OrderDate;

You can see here that the first statement performs only a clustered index scan and doesn't sort the data. This is because the data is already sorted by SalesOrderID because this column is the clustered index key. Therefore, SQL Server only needs to scan the data on the leaf level and return the result to get the rows in the right order.

For the second query, the data has to be sorted after it is retrieved. Therefore, you see a Sort operation after the clustered index scan where the data is sorted by the OrderDate column. Since sorting is a very expensive operation, the second query produces 93% of the total query costs of both queries. Thus, it is a good practice to define clustered indexes on columns that are frequently used as sort columns, or as the grouping criteria of an aggregate, because aggregating data requires SQL Server to sort the data according to the grouping criteria first.

Now you will build a composite index on the OrderDetails table. A composite index is an index that is defined for more than one column. The index keys will be sorted by the first index key column, then by the second, and so on. Such an index is useful when two or more columns are used together as search arguments in a query or when the uniqueness of a row has to be determined throughout more than one column.

Building a Composite Clustered Index

In SQL Server Management Studio, enter and execute the following statement to create a composite clustered index on the OrderDetails table.

CREATE UNIQUE CLUSTERED INDEX CLIDX_OrderDetails     ON dbo.OrderDetails(SalesOrderID,SalesOrderDetailID)

Now enter the two following SELECT statements. The first searches for a specific SalesOrderID and the second for a SalesOrderDetailID. Both of these columns are index columns of our CLIDX_OrderDetails index. Press Ctrl+L to show the estimated execution plan.

SELECT * FROM dbo.OrderDetails WHERE SalesOrderID = 46999 SELECT * FROM dbo.OrderDetails WHERE SalesOrderDetailID = 14147

You can see that in the first query, which searches for a value from the first column of the composite index, SQL Server uses an index seek to find the row. In the second query it uses an index scan, which is a much more expensive operation. This index scan is used because it is not possible to seek only for a value from the second column of a composite index since the index is initially sorted by the first column. Thus, it is important to consider the order of the index columns in your composite indexes. Remember, composite indexes should only be used when the additional columns are always searched in combination with the first column or when uniqueness must be enforced.

Nonclustered Indexes

In contrast with clustered indexes, nonclustered indexes do not include all of the data rows in the leaf level of the index. Instead, all of the key columns and a pointer to the row in the table is stored on the leaf level. How the pointer is written and used depends on whether the base table is a heap or has a clustered index:

Heap If the table has no clustered index, SQL Server stores a pointer to the physical row (file id, page id, and row id within the page) at the leaf level of the nonclustered index. To find a specific row in this case, SQL Server seeks through the index and follows the pointer to retrieve the row.
Clustered index When a clustered index exists, SQL Server stores the clustering index keys of the rows as pointers in the leaf level of the nonclustered index. If SQL Server retrieves a row by means of a nonclustered index, it seeks through the nonclustered index, retrieves the appropriate clustering key, and then seeks through the clustered index to retrieve the row.

Since nonclustered indexes don't include whole data rows, it is possible to create up to 249 nonclustered indexes per table. The syntax to create them is very much the same as for clustered indexes:

CREATE [ UNIQUE ] NONCLUSTERED INDEX index_name     ON <object> ( column [ ASC | DESC ] [ ,...n ] )

Creating and Using Nonclustered Indexes

Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.

Enter the following SELECT statement and press Ctrl+L to show the estimated execution plan. The code for this example is included in the sample files as Creating And Using Nonclustered Indexes.sql.

SELECT DISTINCT SalesOrderID, CarrierTrackingNumber     FROM dbo.OrderDetails     WHERE ProductID = 776

SQL Server performs a clustered index scan operation because it has no index on ProductID. To speed up this query, SQL Server would need an index on ProductID. Because a clustered index is already defined on the OrderDetails table, you must use a nonclustered index.

Note

The Sort operator is used in this query plan to make the result distinct.

Enter and execute the following statement to create a nonclustered index on the ProductID table of the OrderDetail table.

CREATE INDEX NCLIX_OrderDetails_ProductID     ON dbo.OrderDetails(ProductID)

Use the same SELECT statement as before and press Ctrl+L to show the estimated execution plan.

SELECT DISTINCT SalesOrderID, CarrierTrackingNumber     FROM dbo.OrderDetails     WHERE ProductID = 776

When you let the mouse hover over the upper Index Seek operator, you can see that SQL Server performs an index seek on NCLIX_OrderDetails_ProductID to retrieve the pointers to the actual rows. Because a clustered index exists on that table, it receives the list of clustering keys as pointers. This list is the input of the Nested Loops operator, which is a type of Join operator (joins will be discussed later in the chapter). The Nested Loops operator uses an index seek on the clustered index to retrieve the actual rows of data, which then go to the distinct Sort operator to make the result distinct. This is how SQL Server retrieves rows with the help of a nonclustered index when a clustered index exists.

Now let's look at how SQL Server accesses data when a nonclustered index exists on a table with no clustered index. Enter and execute the following DROP INDEX statement to drop the clustered index from the OrderDetails table.

DROP INDEX OrderDetails.CLIDX_OrderDetails

Enter the same SELECT statement as before and press Ctrl+L to show the estimated execution plan.

SELECT DISTINCT SalesOrderID, CarrierTrackingNumber     FROM dbo.OrderDetails     WHERE ProductID = 776

You can see that SQL Server now uses a RID Lookup operator because the pointers SQL Server receives from the index seek are pointers to the physical rows of data, not the clustering key. The RID Lookup operator is the operator that SQL Server uses to retrieve rows directly from a page.

Enter and execute the following statement to create the clustered index again.

CREATE UNIQUE CLUSTERED INDEX CLIDX_OrderDetails     ON dbo.OrderDetails(SalesOrderID,SalesOrderDetailID)

Using Covered Indexes

It is not always necessary for SQL Server to retrieve the whole row as a second step when using nonclustered indexes. This situation arises when the nonclustered index includes all the data of the table SQL Server needs to perform the operation. When this happens, we call the index a covered index because that index covers the entire query. Covered indexes can speed up query performance tremendously, as you can see from the two query plans in the previous example. In those queries, operators that retrieve the actual data rows cost 97% of the whole query. In other words, the query would be about 32 times faster without that operation. Let's have a look how covered indexes work.

Using a Covered Index

1.	Open SQL Server Management Studio, open a New Query window, and change the database context to AdventureWorks.
2.	The idea of a covered index is that it includes all data necessary for the query. If you have a look at the first query below, which you used in the previous example, you can see that SQL Server needs the columns SalesOrderID, CarrierTrackingNumber, and ProductID. The nonclustered index NCLIX_OrderDetails_ProductID you created before includes the ProductID because it is built on that column and also SalesOrderID because this column is the key of the clustered index. Because of this, SalesOrderID is the pointer SQL Server uses in the nonclustered index. Therefore, SQL Server needs to retrieve the data row by seeking the clustered index only to get the CarrierTrackingNumber. In the second query, the CarrierTrackingNumber is not in the SELECT list. Type and execute the statement with the actual execution plan included to see the difference. The code for this example is included in the samples files as Using Covered Indexes.sql. SET STATISTICS IO ON --not covered SELECT DISTINCT SalesOrderID, CarrierTrackingNumber FROM dbo.OrderDetails WHERE ProductID = 776 --covered SELECT DISTINCT SalesOrderID FROM dbo.OrderDetails WHERE ProductID = 776 SET STATISTICS IO OFF You can see in the graphic shown below that SQL Server needs no access to the clustered index for the second query because the index covers the query when CarrierTrackingNumber is not selected. Because accessing the clustered index for every row is very costly, the second query uses only 1% of the total batch cost. Looking at the Messages tab, you will see that for the covered query, SQL Server needs only 2 page reads as opposed to 709 for the first query. You have seen now how big the benefit from a covered index can be. Of course, it is not possible to eliminate columns from queries if they are needed in the result. But as a general rule, retrieve only the columns you truly need in order to have a higher likelihood of using a covered index. It is not a good practice to use * in the SELECT statement just because it is easier to write.
3.	Suppose the CarrierTrackingNumber is needed in the query, but for performance reasons a covered index should be used. It is possible to include the column in the nonclustered index in SQL Server 2005. Included columns are stored with the index keys on the leaf level of the nonclustered index, making it unnecessary to retrieve from the clustered index. To include the CarrierTrackingNumber in your nonclustered index, enter and execute the following statements to DROP the index and CREATE the index again with the included column. DROP INDEX dbo.OrderDetails.NCLIX_OrderDetails_ProductID CREATE INDEX NCLIX_OrderDetails_ProductID ON dbo.OrderDetails(ProductID) INCLUDE (CarrierTrackingNumber)
4.	Execute the former uncovered query to see whether it is now covered. SET STATISTICS IO ON SELECT DISTINCT SalesOrderID, CarrierTrackingNumber FROM dbo.OrderDetails WHERE ProductID = 776 SET STATISTICS IO OFF

Looking at the execution plan, you can see that no clustered index access is needed anymore. On the Messages tab you can see that the query needs only five page reads now as opposed to the 709 it needed before.

Note

Included columns produce overhead for SQL Server when data changes because SQL Server has to change it in every index and because more space in the data files is needed. Therefore, including columns in nonclustered indexes is a good way to improve query performance, but it is not a good practice to create indexes with included columns for all queries in an application. This feature should be used selectively to speed up problematic queries.

Indexes on Computed Columns

In the dbo.OrderDetails table, you have the computed column LineTotal, which represents the LineTotal of the row and is calculated by the formula below.

(isnull(([UnitPrice]*((1.0)-[UnitPriceDiscount]))*[OrderQty],(0.0)))

Each time this column is accessed, SQL Server calculates the values from the values of the original rows referenced by the calculation. This process is not a problem as long as LineTotal is used in the SELECT clause, but it is an issue when LineTotal is used as a search predicate in the WHERE clause or in aggregation functions, like MAX or MIN. When computed columns are used for searches, SQL Server has to calculate the value for every row of the table and then search within the result for the requested rows. This is a very inefficient process because it always requires a table or full clustered index scan. For these types of queries, it is possible to build indexes on computed columns. When an index is built on a computed column, SQL Server calculates the result in advance and builds an index over them.

Note

There are some restrictions on building indexes on computed columns. The main restrictions are that the computed column has to be deterministic and precise. For more information on these restrictions, see the SQL Server Books Online topic "Creating Indexes on Computed Columns."

Creating and Using Indexes on Computed Columns

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	Imagine you want to retrieve all SalesOrderIDs where LineTotal has a specific value. Enter the query below and press Ctrl+L to see the estimated query plan when no supporting index on the computed column exists. As you can see, SQL Server needs to do a clustered index scan, compute the column values, and filter for the required rows. The code for this example is included in the sample files as IndexesOnComputedColumns.sql. SELECT SalesOrderID FROM OrderDetails WHERE LineTotal = 27893.619
3.	Enter and execute the following CREATE INDEX statement to build an index on the computed column. CREATE NONCLUSTERED INDEX NCL_OrderDetail_LineTotal ON dbo.OrderDetails(LineTotal)
4.	Highlight the first query and press Ctrl+L to show the new query plan with the index on the computed column. As shown below, you see that SQL Server uses the newly built index to retrieve the data, making the query much faster than before. SELECT SalesOrderID FROM OrderDetails WHERE LineTotal = 27893.619
5.	Close SQL Server Management Studio.

Indexes on XML Columns

SQL Server 2005 has a native XML datatype. XML instances in XML datatypes are stored as binary large objects (BLOBs) and can have a size up to 2 GB per instance. XML data can be queried using XQuery, but querying XML datatypes can be very time consuming without an index. This is true especially of large XML instances because SQL Server has to shred the binary large object containing the XML at runtime to evaluate the query.

To improve query performance on XML datatypes, XML columns can be indexed. XML indexes fall in two categories: primary XML indexes and secondary XML indexes.

Creating and Using Primary XML Indexes

The first index to create on an XML column is the primary XML index. When creating that index, SQL Server shreds the XML content and creates several rows of data that include information like element and attribute names, the path to the root, node types and values, and so on. With this information, SQL Server can support XQuery requests more easily. To build a primary XML index, a primary key with a clustered index has to exist on the base table. The syntax to build a primary XML index is shown below:

CREATE PRIMARY XML INDEX index_name     ON <object> ( xml_column_name )

Creating a Primary XML Index

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	Enter and execute the following statements to create a table to use in the following procedures. The code for this is included in the sample files as CreatingAndUsingPrimary-XMLIndexes.sql. CREATE TABLE dbo.Products( ProductID int NOT NULL, Name dbo.Name NOT NULL, CatalogDescription xml NULL, CONSTRAINT PK_ProductModel_ProductID PRIMARY KEY CLUSTERED ( ProductID )); INSERT INTO dbo.Products (ProductID,Name,CatalogDescription) SELECT ProductModelID, Name, CatalogDescription FROM Production.ProductModel;
3.	Use the following CREATE INDEX statement to create a primary index on the CatalogDescription Column of the Production.ProductModel table. CREATE PRIMARY XML INDEX PRXML_Products_CatalogDesc ON dbo.Products (CatalogDescription);
4.	The following statement uses an XQuery method to retrieve only XML data when a specific path exists within the XML document. Enter and execute this statement, and be sure to include the query execution plan. WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works /ProductModelDescription' AS "PD") SELECT ProductID, CatalogDescription FROM dbo.Products WHERE CatalogDescription.exist ('/PD:ProductDescription/PD:Features') = 1
5.	In the following execution plan, you can see that SQL Server uses the index on the XML column to find the right rows and retrieves the actual data row with a Nested Loop operator. SQL Server then uses an index seek against the clustered index in the same manner that you observed with nonclustered indexes.
6.	Close SQL Server Management Studio.

Secondary XML Indexes

Although primary XML indexes improve XQuery performance because the XML data is shredded, SQL Server still needs to scan through the shredded data to find the queried data. To improve query performance even more, it is possible to create secondary XML indexes on top of primary XML indexes. There are three types of secondary XML indexes, and each type supports specific query types against XML columns, providing the functionality to create only the index types needed for a specific scenario. The three types of secondary XML indexes are:

Path Secondary XML indexes, which are useful when using the .exist methods to determine whether a specific path exists.
Value Secondary XML indexes, which are used when performing value-based queries where the full path is unknown or includes wildcards.
Property Secondary XML indexes, which are used to retrieve values when the path to the value is known.

The general syntax for creating secondary XML indexes is:

CREATE XML INDEX index_name     ON <object> ( xml_column_name )     USING XML INDEX xml_index_name     FOR { VALUE | PATH | PROPERTY }

Creating and Using Secondary XML Indexes

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	Enter and execute the following statement to create secondary path, value, and property indexes on the XML column CatalogDescription. The code for this example is included in the sample files as CreatingAndUsingSecondaryXMLindexes.sql. CREATE XML INDEX IXML_Products_CatalogDesc_Path ON dbo.Products (CatalogDescription) USING XML INDEX PRXML_Products_CatalogDesc FOR PATH CREATE XML INDEX IXML_Products_CatalogDesc_Value ON dbo.Products (CatalogDescription) USING XML INDEX PRXML_Products_CatalogDesc FOR VALUE CREATE XML INDEX IXML_Products_CatalogDesc_Property ON dbo.Products (CatalogDescription) USING XML INDEX PRXML_Products_CatalogDesc FOR PROPERTY
3.	Enter the following statements, which use XQuery methods, and request the estimated execution plans by pressing Ctrl+L. Examine the different ways SQL Server uses these newly created indexes. WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ ProductModelDescription' AS "PD") SELECT * FROM dbo.Products WHERE CatalogDescription.exist ('/PD:ProductDescription/@ProductModelID[.="19"]') = 1; WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ ProductModelDescription' AS "PD") SELECT * FROM dbo.Products WHERE CatalogDescription.exist ('//PD:*/@ProductModelID[.="19"]') = 1; WITH XMLNAMESPACES ('http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/ ProductModelDescription' AS "PD") SELECT CatalogDescription.value('(/PD:ProductDescription/@ProductModelID)[1]', 'int') as PID FROM dbo.Products WHERE CatalogDescription.exist ('/PD:ProductDescription/@ProductModelID') = 1;
4.	Close SQL Server Management Studio.

Indexes on Views

A view without an index doesn't need any storage space, and when it is used in a statement, SQL Server merges the view definition with the statement, optimizes it, produces a query plan, and retrieves the data. The overhead of this process can be significant when the view processes or joins a large number of rows. In such a situation, it can be helpful to index the view if it is queried frequently.

When a view is indexed, it is processed and the result is stored in the data file just like a clustered table. SQL Server automatically maintains the index when data in the base table changes. Indexes on views can speed up data access through the view tremendously, but of course indexing does involve overhead when data in the base table changes. Therefore, consider using indexed views when the view processes many rows, as with aggregate functions, and when the data in the base table does not change frequently.

With SQL Server 2005 Enterprise, Developer, or Evaluation Edition, indexed views can speed up queries that don't reference the view directly. When the query being processed includes an aggregate, for example, and SQL Server Optimizer finds an indexed view where the aggregate is already included, it retrieves the aggregate from the index rather than recalculating it.

To create an indexed view, use the following steps:

Creating an Indexed View

1.	Create a view with SCHEMABINDING. The view has to meet several requirements. For instance, it can only reference base tables that exist in the same database. All referenced functions have to be deterministic; rowset functions, derived tables, and sub-queries are not allowed at all. The full list of requirements can be found in the SQL Server Books Online topic "Creating Indexed Views."
2.	Create a unique clustered index on the view. The leaf level of this index consists of the full result set of the view.
3.	Create nonclustered indexes on top of the clustered index as needed. Nonclustered indexes can be created as usual.

Creating and Using Indexed Views

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	Enter and execute the following statement to create a view that aggregates the LineTotal grouped by the month of the order. The code for this example is included in the sample files as CreatingAndUsingIndexedViews.sql. CREATE VIEW dbo.vOrderDetails WITH SCHEMABINDING AS SELECT DATEPART(yy,Orderdate) as Year, DATEPART(mm,Orderdate) as Month, SUM(LineTotal) as OrderTotal, COUNT_BIG(*) as LineCount FROM dbo.Orders o INNER JOIN dbo.OrderDetails od ON o.SalesOrderID = od.SalesOrderID GROUP BY DATEPART(yy,Orderdate), DATEPART(mm,Orderdate)
3.	Enter and execute the following SELECT statement. On the Messages tab, you can see that SQL Server needs almost 1000 page reads to execute the statement. SET STATISTICS IO ON SELECT Year, Month, OrderTotal FROM dbo.vOrderDetails ORDER BY YEAR, MONTH SET STATISTICS IO OFF
4.	Enter and execute the following CREATE INDEX statement to create a unique clustered index on the vOrderDetails view. CREATE UNIQUE CLUSTERED INDEX CLIDX_vOrderDetails_Year_Month ON dbo.vOrderDetails(Year,Month)
5.	Execute the following SELECT statement. Notice that SQL Server now needs only two page reads because the result is already calculated and stored in the index. SET STATISTICS IO ON SELECT Year, Month, OrderTotal FROM dbo.vOrderDetails ORDER BY YEAR, MONTH SET STATISTICS IO OFF
6.	If you have SQL Server 2005 Enterprise, Developer, or Evaluation editions installed, type and execute the following SELECT statement, which does not reference the view, and press Ctrl+L to request the estimated execution plan, which is shown below. SELECT DATEPART(yy,Orderdate) as Year, SUM(LineTotal) as YearTotal FROM dbo.Orders o INNER JOIN dbo.OrderDetails od ON o.SalesOrderID = od.SalesOrderID GROUP BY DATEPART(yy,Orderdate)
7.	The preceding query plan shows that SQL Server uses the clustered index on the view to retrieve the data because it is more efficient to create the YearTotal aggregate by summing up the monthly aggregates found in the view. Thus, you can see that the presence of indexed views makes it possible to speed up a query without recoding the query itself.
8.	Close SQL Server Management Studio.

Indexes that Speed Up Join Operations

Join operators are used to join together tables or intermediate results. SQL Server uses three types of join operators.

Nested Loop Joins use one join input as the inner input table and the other join input as the outer input table. Nested loops scan once through the each inner input row and search for the corresponding row in the outer input. When an index exists on the columns of the join condition of the outer input, SQL Server can use an index seek to find the rows in the outer input. If no index exists, SQL Server has to use scan operators to find matching rows in the outer input for every row of the inner input. Nested loops are always used when the inner input has only a few rows because, in that case, it is the most efficient join operation.
Merge Joins are used when the join inputs are sorted on their join columns. In a merge join operation SQL Server scans once through the sorted inputs and merges the data together, like closing a zipper. The merge join operation is very efficient, but data has to be sorted in advance, which means that indexes have to exist on the join columns. If no index exists, SQL Server can decide to sort the input first, but this is not done very often because sorting data is generally inefficient.
Hash Joins are used on large, unsorted, non-indexed inputs. Hash joins use hashing operations on join columns to join inputs together. To calculate and store the result of a hash operation, SQL Server needs more memory and CPU processing than other join operations.

Examining Join Operations

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	Enter and execute the following statement, and don't forget to include the actual execution plan. The code for this example is included in the sample files as Examining Join Operations.sql. The query plan of this statement is shown below. You can see that SQL Server uses an index seek to retrieve the rows of the inner input (data from the OrderDetails table) and then uses a Nested Loop Join operator because there is only one matching row in the outer input (data from the Orders table). The matching rows of the outer input are retrieved with an index seek as well because a matching index exists. SQL Server only needs five page reads to retrieve the data, as you can see on the Messages tab. SET STATISTICS IO ON SELECT o.SalesOrderID, o.OrderDate, od.ProductID FROM dbo.Orders o INNER JOIN dbo.OrderDetails od ON o.SalesOrderID = od.SalesOrderID WHERE o.SalesOrderID = 43659
3.	Change the query to retrieve more than one SalesOrderID. In this case, SQL Server uses a merge join because the sorted rows exist in an index and the inner input has many rows. Enter and execute the following query. You will see that SQL Server needs 19 page reads to execute this query. SELECT o.SalesOrderID, o.OrderDate, od.ProductID FROM dbo.Orders o INNER JOIN dbo.OrderDetails od ON o.SalesOrderID = od.SalesOrderID WHERE o.SalesOrderID BETWEEN 43659 AND 44000
4.	Enter and execute the following batch, which drops the supporting indexes. DROP INDEX CLIDX_Orders_SalesOrderID ON dbo.Orders DROP INDEX CLIDX_OrderDetails ON dbo.OrderDetails
5.	Re-execute the SELECT statements you used before (shown again below) and examine how the query plans and I/O have changed. SELECT o.SalesOrderID, o.OrderDate, od.ProductID FROM dbo.Orders o INNER JOIN dbo.OrderDetails od ON o.SalesOrderID = od.SalesOrderID WHERE o.SalesOrderID = 43659 SELECT o.SalesOrderID, o.OrderDate, od.ProductID FROM dbo.Orders o INNER JOIN dbo.OrderDetails od ON o.SalesOrderID = od.SalesOrderID WHERE o.SalesOrderID BETWEEN 43659 AND 44000 The preceding query plan shows that SQL Server uses a nested loop join again for the first query, because the inner input is small, and a hash join for the second query, because the input data is no longer sorted. Because no supporting indexes exist, SQL Server has to scan the entirety of the base tables in both cases, which requires about 1000 page reads for both queries.

As you have seen in this section, it is very important to have supporting indexes for joins. As a general rule, all foreign key constraints should be indexed because they are the join conditions for almost all queries.

Data Distribution and Statistics

In the last example, you saw that SQL Server chooses different Join operators based on the input size for the joins. Also, for other operations, like index seeks or scans, SQL Server needs to know how many rows are affected to determine the best operator to use. This determination is made based on statistics because SQL Server needs to make it before actually accessing the data. These statistics are created and updated automatically on a column basis by SQL Server using the steps detailed below:

1.	A query is submitted to SQL Server.
2.	SQL Server Query Optimizer starts by determining which data has to be accessed.
3.	SQL Server looks for statistics on the columns accessed. If statistics already exist and are up to date, SQL Server can proceed. When no statistics exist, SQL Server generates new statistics. When statistics exist but they are not up to date, SQL Server calculates new statistics for the data.
4.	SQL Server Optimizer proceeds with generating the query plan.

This is the default behavior, and it is the best choice for most databases. You can use the ALTER DATABASE statement to tell SQL Server that it should update statistics asynchronously, which means that it won't wait for the new statistics when generating a query plan. Of course, this means that the query plan generated may not be optimal because old statistics were used to create it. In special situations, it can be desirable to generate or update statistics manually. This can be done with the CREATE STATISTICS or UPDATE STATISTICS statements. It is also possible to turn off automatic index creation and updating with the ALTER DATABASE statement at the database level. All of these options should be used only in special situations because the default behavior works best in most cases. To learn more about these options, read the whitepaper "Statistics Used by the Query Optimizer in Microsoft SQL Server 2005" at http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx.

Viewing Data Distribution Statistics

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	To get information on which statistics exist, you can query the sys.stats and sys.stats_columns views. Enter and execute the following query to find out which statistics exist for the dbo:OrderDetails table. The code for this example is included in the sample files as DataDistributionStatistics.sql. SELECT NAME, COL_NAME ( s.object_id , column_id ) as CNAME FROM sys.stats s INNER JOIN sys.stats_columns sc ON s.stats_id = sc.stats_id AND s.object_id = sc.object_id WHERE s.object_id = OBJECT_ID('dbo.OrderDetails') ORDER BY NAME; The result tells us the names of the statistics, and the columns for which they are calculated. Statistics for indexed columns are named like the corresponding index. The statistics with names starting with _WA_Sys_ are statistics SQL Server created automatically for columns without indexes.
3.	To get statistics information from a statistic the DBCC SHOW_STATISTICS statement can be used. Enter and execute the following statement to show the statistics for the LineTotal column of the dbo.OrderDetails table. DBCC SHOW_STATISTICS('dbo.OrderDetails', 'NCL_OrderDetail_LineTotal') The results pane shown below displays part of the output of the DBCCSHOW_STATISTICS statement. The first part displays general information, such as creation date, the number of rows in the table, and how many of the rows are sampled. Data density information is also displayed. Density is a value that shows how distinct values in the column are. Besides this general information SQL Server defines ranges within the data called steps and stores distribution statistics for these steps. With these distribution statistics SQL Server can estimate the number of rows affected when searching for a specific value by using the statistics for the range the specific values belong to. For every step, SQL Server stores the following information: RANGE_HI_KEY is the upper boundary value of the step. EQ_ROWS is the number of rows that are equal to the RANGE_HI_KEY. RANGE_ROWS is the number of rows in the range, without the boundaries. DISTINCT_RANGE_ROWS is the number of distinct values within the range. AVG_RANGE_ROWS is the average number of duplicate values within the range.
4.	Close SQL Server Management Studio.

Index Fragmentation

When a clustered index is created, the data is stored in the leaf level in a sorted order. In nonclustered indexes, the index keys are sorted and stored in the leaf level. To get the best performance from the leaf level, pages of indexes are ordered not only logically, but also physically. SQL Server stores these index pages in the appropriate physical order to be able to read the data as fast as possible from the disk. The most expensive part of reading data from disks is moving the arm of the disk. When the data pages to be read are ordered sequentially, the amount of arm movement needed is minimized, thus optimizing read performance.

When data is inserted into a table, it is stored at a specific page of a clustered index leaf-level page. The index keys of the nonclustered index also have to be inserted onto the right pages of the nonclustered index leaf-level pages. If this page is full, SQL Server has to perform a page split, which means allocating a new page and linking the new page to the appropriate index. This situation leads to index fragmentation, which means that the logical order of the data pages no longer matches the physical order. Index fragmentation can also happen as a result of UPDATE and DELETE statements.

To reduce fragmentation, the FILLFACTOR option can be used at index creation, which defines to what percentage leaf-level pages of indexes should be filled when the indexes are created. With a lower FILLFACTOR, fragmentation is less likely to occur because more entries can be made into a leaf-level page without the page splitting. On the other hand, a lower FILLFACTOR will yield a larger index because less data is initially stored on each leaf-level page.

When the table being indexed is not a read-only table, its indexes will fragment sooner or later. Fragmented indexes can be defragmented to improve the speed of data access by using the ALTER INDEX statement. There are two options for defragmentation:

REORGANIZE Reorganizing an index means that the leaf-level pages are sorted using a bubble sort operation. REORGANIZE sorts only the data pages, not the entries within the page, which means that a FILLFACTOR cannot be used while reorganizing.
REBUILD Rebuilding an index means that the whole index is rebuilt. This takes longer than reorganizing the index, but has better results. A FILLFACTOR option can be supplied to fill the pages to the desired percentage again. If no FILLFACTOR option is supplied, leaf-level pages are filled completely. The ONLINE option can also be supplied when rebuilding indexes. When the option is not supplied, the rebuild is performed offline, which means that the table is locked throughout the rebuild. Offline rebuilding is faster than online rebuilding, but because it locks the data, it cannot be used during times when data access is required.

Maintaining Indexes

1.	Open SQL Server Management Studio. Open a New Query window and change the database context to AdventureWorks.
2.	To retrieve fragmentation information, use the rowset function sys.dm_db_physical_stats. Enter and execute the following statement to retrieve the list of indexes with a fragmentation over 50%. The code for this section is included in the sample files as IndexFragmentation.sql. SELECT object_name(i.object_id) as object_name ,i.name as IndexName ,ps.avg_fragmentation_in_percent ,avg_page_space_used_in_percent FROM sys.dm_db_index_physical_stats(db_id(), NULL, NULL, NULL , 'DETAILED') as ps INNER JOIN sys.indexes as i ON i.object_id = ps.object_id AND i.index_id = ps.index_id WHERE ps.avg_fragmentation_in_percent > 50 AND ps.index_id > 0 ORDER BY 1
3.	To perform an online rebuild of the PK_Employee_EmployeeID index, enter and execute the following statement. ALTER INDEX PK_Employee_EmployeeID ON HumanResources.Employee REBUILD WITH (ONLINE = ON)
4.	Close SQL Server Management Studio.

Note

Usually the process of index defragmentation will be automated. This can be done using a maintenance plan (see the SQL Server Books Online topic titled "How to: Create a Maintenance Plan") or through writing your own scripts, which can be scheduled using SQL Server Agent.