Physical Database Design, Analysis, and Tuning

Physical database design refers to specifying how on-disk data structures are stored and accessed. It plays an important role in maximizing the performance of and simplifying the management of databases. In general, physical database design involves where and what kind of indexes to use, how to partition the data, how to organize data and log in files and filegroups, and so on.

Imagine that you are working with developers to optimize the physical database design of a database that contains 50 tables. One of the tables has a million rows and grows at a rate of approximately 50,000 rows a weekday. There are about 1,500 users entering data into the system and about 50 users generating and viewing reports most of the time. Let's say you have gone through the index analysis process, but still the performance is not as expected. The following sections describe some new things that you can try in SQL Server 2005 to improve the physical database design.

Creating Indexes with Included Columns

One of the best recommendations for nonclustered indexes is to create a narrow and efficient index. For best performance, it is recommended that you avoid using too many columns in an index. A narrow index can accommodate more rows into an 8KB index page than a wide index, thereby reducing I/O, reducing storage requirements, and improving database caching.

On the other hand, database professionals often try to gain performance by creating a covering index. A covering index is a nonclustered index built upon all the columns required to satisfy a SQL query without going to the base table. However, the key size cannot exceed 900 bytes, and you can have up to 16 columns in an index.

SQL Server 2005 introduces the new concept of including columns with nonclustered indexes. It results in smaller, efficient, and narrow keys, and at the same time it provides the benefits of a covering index and removes the 900 bytes/16 columns key size restriction. When you create a nonclustered index by using a CREATE NONCLUSTERED INDEX DDL statement, you can use the INCLUDE clause to specify up to 1,023 columns that are non-key columns to be included to the leaf level of the index. The included columns are not part of the index key, which keeps the key size small and efficient. However, having them available at the leaf pages with the index means you can avoid querying the base table if the included columns can satisfy the query.

In a nutshell, including non-key columns helps keep the key size small and efficient, provides the benefit of a covering index to improve the performance, and removes the 900 bytes/16 columns key size restriction because the included columns are not considered part of the index key.

Note

Columns of data type text, ntext, and image are not allowed as non-key included columns.

Let's look at an example of how included non-key columns can be used to create efficient nonclustered indexes and to improve performance. The following statements create a sample table and insert some test data into the table:

SET NOCOUNT ON; USE AdventureWorks; GO IF OBJECT_ID('dbo.tblTest') IS NOT NULL    DROP TABLE dbo.tblTest; GO CREATE TABLE dbo.tblTest (C1 INT, C2 INT, C3 CHAR(255)); GO DECLARE @counter INT; SET @counter = 1; WHILE @counter <= 99999 BEGIN    INSERT INTO dbo.tblTest VALUES(@counter%500, @counter, @counter);    SET @counter = @counter + 1; END;

The table does not have any indexes yet. Next, you should query the table:

SET STATISTICS IO ON; SELECT C1, C2, C3 FROM dbo.tblTest WHERE C1 = 2; GO

After you run this query, you should notice that the number of logical reads is 3,449 and that the execution plan shows the table scan.

You can optimize the preceding query by creating a nonclustered index:

CREATE NONCLUSTERED INDEX idxtblTest ON dbo.tblTest(C1); GO SELECT index_type_desc, index_depth, index_level, page_count, record_count FROM sys.dm_db_index_physical_stats (DB_ID(),    OBJECT_ID('dbo.tblTest'), OBJECT_ID('idxtblTest'), NULL, 'DETAILED') WHERE index_id > 1; SELECT C1, C2, C3 FROM dbo.tblTest WHERE C1 = 2; GO

These Transact-SQL (T-SQL) statements create a narrow nonclustered index on column C1. The first SELECT query here uses a dynamic management view (DMV) to find out the total number of pages used by the index. The index uses about 224 pages.

After you run the preceding query, you should notice that the number of logical reads comes down to 204 and that the execution plan shows the use of an index and avoids a table scan. The execution plan shows RID lookup, which is a bookmark lookup accompanied by a nested loop join.

You can further optimize the query by creating a covering index:

CREATE NONCLUSTERED INDEX idxtblTest ON dbo.tblTest(C1, C2, C3)    WITH DROP_EXISTING; GO SELECT index_type_desc, index_depth, index_level, page_count, record_count FROM sys.dm_db_index_physical_stats (DB_ID(),    OBJECT_ID('dbo.tblTest'), OBJECT_ID('idxtblTest'), NULL, 'DETAILED') WHERE index_id > 1; SELECT C1, C2, C3 FROM dbo.tblTest WHERE C1 = 2; GO

The index key now contains all the columns queried in the SELECT statement. Because the nonclustered index key now contains more columns, the index will use a higher number of pages. The DMV SELECT query proves this, and you should notice that the index now uses about 3,581 pages.

The SELECT query on the table is optimized, and you should notice that the number of logical rows has now come down to 12 and that the execution plan shows the index seek as 100% of the cost.

Here's what happens when you use non-key included columns instead of a covering index:

CREATE NONCLUSTERED INDEX idxtblTest ON dbo.tblTest(C1)    INCLUDE (C2, C3)    WITH DROP_EXISTING; GO SELECT index_type_desc, index_depth, index_level, page_count, record_count FROM sys.dm_db_index_physical_stats (DB_ID(),    OBJECT_ID('dbo.tblTest'), OBJECT_ID('idxtblTest'), NULL, 'DETAILED') WHERE index_id > 1; SELECT C1, C2, C3 FROM dbo.tblTest WHERE C1 = 2; GO

Note that the index key contains only the C1 column; C2 and C3 are included non-key columns. The DMV SELECT statement indicates that the index is using 3,463 pages, and the SELECT query on the table shows 11 logical reads and shows the index seek as 100% of the cost. (Compare this with the 3,581 index pages and 12 logical reads in the covering index.) On top of this, the index key size is small and efficient, and you can also overcome the 900 bytes/16 columns index keys restriction by using non-key included columns.

The data for non-key columns resides in the base table data pages and is also duplicated in index leaf-level pages. Therefore, you should avoid including unnecessary columns as non-key index columns. Because the included column data is duplicated, an index with included columns consumes more disk space, especially if included columns are of the varchar(max), nvarchar(max), varbinary(max), or xml data type. Fewer index rows will fit on an index page, which might increase the I/O and decrease the database cache efficiency. Index maintenance may increase in terms of the time that it takes to perform modifications, inserts, updates, or deletions to the underlying table. You should do analysis and testing to determine whether the gains in query performance outweigh the effect on performance during data modification and the additional disk space requirements.

Creating Indexes on Computed Columns

A computed column's value is calculated from an expression by using other columns in the same table. With SQL Server 2000, computed columns are always virtual columns, not physically stored in the table. In order to create an index on such columns, the column expression must be deterministic and precise. In other words, an index cannot be created on a computed column if the column expression uses a nondeterministic function such as USER_ID() or CHARINDEX() or if the column expression results in an imprecise value such as a floating-point number.

SQL Server 2005 introduces the ability to persist the computed column values. In some situations, this can improve performance because the computed column value is already available in the data pages and is not calculated at runtime. In addition, having persisted computed columns allows you to create indexes on columns that are imprecise. Here's an example of this:

SET NOCOUNT ON; USE AdventureWorks; GO IF OBJECT_ID('dbo.tblTest') IS NOT NULL    DROP TABLE dbo.tblTest; GO CREATE TABLE dbo.tblTest     (C1 float, C2 float, C3 AS C1*C2 PERSISTED, C4 AS C1*C2); GO CREATE INDEX idxTest ON dbo.tblTest(C3); GO --CREATE INDEX idxTest2 ON dbo.tblTest(C4); GO SELECT is_persisted, * FROM sys.computed_columns    WHERE [object_id] = OBJECT_ID('dbo.tblTest'); GO SELECT * FROM sys.indexes WHERE name = 'idxTest' GO

This script creates a sample table with two computed columns. Column C3 is a persisted computed column, whereas C4 is a virtual computed column. Creating an index on an imprecise computed column succeeds if it is persisted; uncommenting and trying to create an index on C4 will fail. You can use the sys.columns system catalog view to find out whether the column is computed, and then you can use the sys.computed_columns catalog view to check whether the computed column is persisted.

Indexing XML Data

SQL Server 2005 introduces a new data type named xml that you can use to store XML documents inside a database. The data stored in xml type columns can be queried by using the XQuery syntax and xml type methods such as exist(), query(), and value(). The XML data is internally stored as binary large objects (BLOBs), and this internal binary representation of XML data cannot exceed 2GB. Without an index on such columns, these BLOBs are parsed and shredded at runtime to evaluate a query, which can have a significant adverse impact on the query performance.

You can use the CREATE XML INDEX DDL statement to create an index on an xml type column. Each table can have up to 249 XML indexes. Here is the T-SQL syntax for creating XML indexes:

CREATE [ PRIMARY ] XML INDEX index_name     ON <object> ( xml_column_name )     [ USING XML INDEX xml_index_name         [ FOR { VALUE | PATH | PROPERTY } ]     [ WITH ( <xml_index_option> [ ,...n ] ) ] [ ; ]       <xml_index_option> ::=       {           PAD_INDEX = { ON | OFF }         | FILLFACTOR = fillfactor         | SORT_IN_TEMPDB = { ON | OFF }         | STATISTICS_NORECOMPUTE = { ON | OFF }         | DROP_EXISTING = { ON | OFF }         | ALLOW_ROW_LOCKS = { ON | OFF }         | ALLOW_PAGE_LOCKS = { ON | OFF }         | MAXDOP = max_degree_of_parallelism       }

The primary XML index is a shredded and persisted representation of the XML BLOBs in the xml data type column. Using a primary XML index avoids the need to shred the XML BLOBs at runtime. After a primary XML index is created and when the shredded representation is available, the query performance can be further improved by creating secondary XML indexes on XML tags, values, and paths. Each xml column in a table can have one primary XML index and multiple secondary XML indexes. A primary XML index is required before any secondary index can be created. There are three types of secondary XML indexesPATH, VALUE, and PROPERTYand they are discussed later in this section.

A primary XML index requires a clustered index to be present on the primary key of the table. If you ever need to modify the primary key clustered index, then all XML indexes on the user table must be dropped first. When a primary XML index is created, SQL Server internally creates a clustered index, a B+ tree, with the clustered key formed from the clustering key of the user table and an XML node identifier. For each XML BLOB in the column, the index creates several rows of data.

Let's assume that the following XML instance is stored into an xml type column:

<book subject="Security" ISBN="0735615882">    <title>Writing Secure Code</title>    <author>       <firstName>Michael</firstName>       <lastName>Howard</lastName>    </author>    <price currency="USD">39.99</price> </book>

When a primary XML index is created on a column containing this XML document, SQL Server internally creates a clustered index that contains the columns and rows listed in Table 9.1.

Table 9.1. Primary XML Index Internal Representation
`PK`	`XID`	`NID` (`TAG`)	`TID`	`VALUE`	`HID`
4	1	1 (book)	Element	Null	#1
4	1.1	2 (ISBN)	Attribute	"0735615882"	#2#1
4	1.5	4 (subject)	Attribute	"Security"	#4#1
4	1.7	5 (title)	Element	Null	#5#1
4	1.7.1	8 (text)	Text node	"Writing..."	#8#5#1
4	1.9	6 (author)	Element	Null	#6#1
4	1.9.1	7 (firstName)	Element	Null	#7#6#1
4	1.9.1.1	8 (text)	Text node	"Michael"	#8#7#6#1

Table 9.1 contains partial columns and rows indicating how the primary XML index is stored internally. The first column, PK, is the value of the primary key column in the base table. The second column, XID, is an internal identifier generated for each node in the XML tree. These first two columns (PK, XID) together form a clustered index key for the primary XML index. The third column is the node ID (NID) and the XML tag name (TAG), the fourth column (TID) indicates the node type, the fifth column (VALUE) contains the node value, and the last column (HID) shows the hierarchical ID, which is an internal tokenized representation of the reversed path from a node to the root of the tree.

After the primary XML index is created, you can further optimize query performance by creating secondary XML indexes. There are three types of secondary XML indexes:

PATH index Secondary indexes built on the paths are useful for path queries such as /person/address/zip. A PATH secondary XML index consists of the HID, VALUE, PK, and XID columns of the primary XML index.
VALUE index These secondary indexes are useful for queries where you know the node value but don't know the location of the node. A VALUE index is useful for queries such as //city[.="Dallas"] or /item/@*[.="NA"]. A VALUE secondary XML index consists of the VALUE, HID, PK, and XID columns of the primary XML index.
PROPERTY index These secondary indexes are useful for "property extraction" scenarios, where queries retrieve multiple, sometimes related, values from individual XML instances.

The following example shows how XML indexes can be used to optimize the queries on xml type columns. You need to first create a sample table and insert some test XML data into the table:

SET NOCOUNT ON; USE AdventureWorks; GO IF OBJECT_ID('dbo.tblXMLTest') IS NOT NULL    DROP TABLE dbo.tblXMLTest; GO CREATE TABLE dbo.tblXMLTest    (c1 int IDENTITY(1, 1) NOT NULL PRIMARY KEY,     c2 xml NOT NULL); GO DECLARE @counter INT; SET @counter = 1 WHILE @counter < 1000 BEGIN    INSERT INTO dbo.tblXMLTest (c2)       VALUES ('<book ><category>Security</category></book>');    INSERT INTO dbo.tblXMLTest (c2)       VALUES ('<book ><category>Architecture</category></book>');    INSERT INTO dbo.tblXMLTest (c2)       VALUES ('<book ><category>Design Patterns</category></book>');    INSERT INTO dbo.tblXMLTest (c2)       VALUES ('<book ><category>Certification</category></book>');    SET @counter = @counter + 1 END; GO

The table yet does not have any index on the xml column. You can click "the Include Actual Execution Plan" toolbar button in Management Studio, run the following query, and study the execution plan:

SELECT * FROM dbo.tblXMLTest    WHERE c2.exist('/book/@id[. = "3"]') = 1; GO

SQL Server parses and shreds the XML BLOB data for each row, and it runs the XPath expression to filter rows where the id attribute value is 3. The execution plan shows the use of the "Table Valued Function XML Reader with XPath Filter" operator, and the estimated subtree cost for the SELECT operator (approximately 4,012) is significantly large.

Next, you should create a primary XML index to create a persisted, shredded representation of the XML data and then execute the same SELECT statement again:

CREATE PRIMARY XML INDEX pxmlidxTest ON dbo.tblXMLTest (c2); GO SELECT * FROM dbo.tblXMLTest    WHERE c2.exist('/book/@id[. = "3"]') = 1; GO

Now, the execution plan no longer contains the XML Reader operator, which proves that XML parsing and shredding is not required. Instead of the XML Reader operator, the execution plan now shows a clustered index scan using the primary XML index, and the estimated subtree cost for the SELECT operator (approximately 0.19) is significantly reduced.

You can further optimize the query by creating a PATH secondary index:

CREATE XML INDEX sxmlidxPathTest ON dbo.tblXMLTest (c2)    USING XML INDEX pxmlidxTest FOR PATH; SELECT * FROM dbo.tblXMLTest    WHERE c2.exist('/book/@id[. = "3"]') = 1; GO

You should notice that the PATH secondary XML index results in a further simplified execution plan with a clustered index seek, and the estimated subtree cost for the SELECT operator (approximately 0.06) is further reduced.

Figure 9.1 shows the preceding three execution plans.

Figure 9.1. XML indexes avoid shredding and parsing XML BLOBs, and the `PATH` secondary XML index further optimizes the query.

You can find more details about XML support in SQL Server 2005 in Chapter 10, "XML and Web Services Support in SQL Server 2005."

Optimizing Physical Design Structures by Using Database Engine Tuning Advisor

As a database's workload, data, and application change over time, the existing indexes may not be entirely appropriate, and new indexes might be required. To help in this process, SQL Server 2000 provided a tool called Index Tuning Wizard (ITW) that could be used to identify an optimal set of indexes for a given workload. The workload could be a SQL trace saved to a trace file (.trc) or a SQL trace saved to a trace table or a SQL script file (.sql). ITW accepted workload as an input, used the query processor to determine the most effective indexes for the workload by simulating index configurations, and gave recommendations on deleting/adding indexes. These recommendations could be saved in a .sql script file to be analyzed and executed later.

SQL Server 2000 ITW had the following limitations:

It could tune only one database during a tuning session.
The wizard-based interface was not well suited for iterative tuning and analysis process.
ITW recommendations were solely based on performance, and not on improving manageability. Therefore, ITW did not make recommendations on how tables/indexes should be partitioned.
ITW did not provide the ability to limit the tuning time.
ITW did not scale well with very large databases and workloads. In addition, ITW could not tune workloads that used T-SQL constructs, such as table-valued functions, temporary tables, or triggers.
ITW allowed a developer to apply the recommendations immediately or at a later time, but it did not allow a developer to evaluate recommendations to perform "what-if" analysis.
Only members of the sysadmin fixed server role could use the ITW to tune databases.

SQL Server 2005 introduces a new tool called Database Engine Tuning Advisor (DTA) that fixes the problems with ITW and offers recommendations on various physical design structures, including indexes, indexed views, and partitioning. Unlike ITW, which provided a wizard-based interface, DTA is a full-fledged application that provides a session-based analysis and tuning approach. DTA allows for the tuning of multiple databases per session and is designed to scale well with very large databases and workloads. You can find more details on DTA in Chapter 5, "SQL Server 2005 Tools and Utilities."

In addition to the physical database design enhancements discussed so far, SQL Server 2005 offers improvements in the way indexed views and partitioning work. The SQL Server 2005 query optimizer can match more queries to indexed views than in previous versions, including queries that contain scalar expressions, scalar aggregate and user-defined functions, and so on. The table and index partitioning feature is enhanced in SQL Server 2005 to simplify the administration, design, and development of a partitioned data set. A developer can optimize query performance by aligning tables and the associated index and also by aligning related tables. Alignment refers to SQL Server's ability to group rows that have the same partitioning key. With partitioned tables in SQL Server 2005, a table can be designed (using a function and a scheme) such that all rows that have the same partitioning key are placed directly on (and will always go to) a specific location. When multiple tables use the same function (but not necessarily the same scheme), rows that have the same partitioning key will be grouped similarly.

Creating Indexes with Included Columns

Creating Indexes on Computed Columns

Indexing XML Data

Table 9.1. Primary XML Index Internal Representation

Figure 9.1. XML indexes avoid shredding and parsing XML BLOBs, and the PATH secondary XML index further optimizes the query.

Optimizing Physical Design Structures by Using Database Engine Tuning Advisor

Figure 9.1. XML indexes avoid shredding and parsing XML BLOBs, and the `PATH` secondary XML index further optimizes the query.