Statistics | The Guru[ap]s Guide to SQL Server[tm] Stored Procedures, XML, and HTML

for RuBoard

You've probably heard the term statistics bandied about in discussions of SQL Server query performance. Statistics are meta-data that SQL Server maintains about index keys and, optionally , nonindexed column values. SQL Server uses statistics to determine whether using an index could speed up a query. In conjunction with indexes, statistics are the single most important source of data for helping the optimizer develop optimum execution plans. When statistics are missing or out-of-date, the optimizer's ability to formulate the best execution plan for a query is seriously impaired.

Let's cover a few basic statistics- related terms before we discuss statistics in more depth.

Cardinality

The cardinality of data refers to how many unique values exist in the data. In strict relational database theory, duplicate rows (tuples) are not permitted within a relation (a table), so cardinality would refer to the total number of tuples. That said, SQL Server does permit duplicate rows to exist in a table, so for our purposes the term cardinality refers to the number of unique values within a data set.

Density

Density refers to the uniqueness of values within a data set. An index's density is computed by dividing the number of rows that would correspond to a given key value by the number of rows in the table. For a unique index, this amounts to dividing 1 by the table's total row count. Density values range from 0 through 1; lower densities are better.

Selectivity

Selectivity is a measure of the number of rows that will be returned by a particular query criterion. It expresses a relationship between your query criteria and the key values in an index. It is computed by dividing the number of keys being requested by the number of rows they access. Query criteria (usually specified in a WHERE clause) that are highly selective are the most useful to the optimizer because they allow it to predict with certainty how much I/O is required to satisfy a query.

Performance Issues

Indexes with high densities will likely be ignored by the optimizer. The most useful indexes to the optimizer have density values of 0.10 or less. Let's take the example of a table called VoterRegistration with 10,000 rows, no clustered index, and a nonclustered index on its PartyAffiliation column. If there are three political parties registered in the voting precinct and they each have about the same representation across the voter base, PartyAffiliation will likely contain only three unique values. This means that a given key value in the index could identify as many as 3,333 rows in the table, perhaps more. This gives the index a density of 0.33 (3,333 · 10,000), and virtually ensures that the optimizer will not use the index when formulating an execution plan for queries that require columns not covered by the index.

To understand this better, let's compare the cost of using the index versus not using it to satisfy a simply query. If we wanted to list all the voters in the precinct affiliated with the Democratic Party, we'd be talking about hitting approximately a third of the table, or 3,333 rows. If we use the PartyAffiliation index to access these rows, we're faced with 3,333 separate logical page reads from the underlying table. In other words, as we found each key value in the index, we'd have to look up its bookmark in the underlying table to get the columns not contained in the index, and each time we did this, we'd incur the overhead of a logical (and possibly physical) page I/O. All told, we may incur as much as 26MB of page I/O overhead to look up these bookmark values (3,333 keys * 8K/page). Now consider the cost of simply scanning the table sequentially. If an average of 50 rows fits on each data page and we have to read the entire table to find all the ones affiliated with the Democratic Party, we're still only looking at about 200 logical page I/Os (10,000 rows · 50 rows/page = 200 pages). This is a big difference and is the chief reason you'll see nonclustered indexes ignored in favor of table/clustered index scans .

At what point does a nonclustered index become sufficiently selective to be useful to the optimizer? In our example, 200 is the magic number. Specifically, the optimizer would have to believe that retrieving data via the index would require fewer than 200 page I/Os in order for it to consider the index a more efficient access path than simply scanning the entire table. The original 3,333 estimate could be lowered by adding columns to the index (and also to the query) that make it more selective. There is a point of diminishing returns here, though. As you add columns to the index in an attempt to make it more selective, you increase the amount of overhead that comes with traversing the index's B-tree. By making the index larger, you also make it more expensive to navigate. At some point, it becomes cheaper just to scan the data itself than to incur the overhead of walking through the B-tree.

Storage

SQL Server stores statistics for an index key or a column in the statblob column of sysindexes. Statblob is an image data type that stores a histogram containing a sampling of the values in the index key or column. For composite indexes, only the first column is sampled, but density values are maintained for the other columns.

During the index selection phase of query optimization, the optimizer decides whether an index matches up with the columns in the filter criteria, determines index selectivity as it relates to that criteria, and estimates the cost of accessing the data the query seeks.

If an index only has one column, its statistics consist of one histogram and one density value. If an index has multiple columns, a single histogram is maintained, as well as density values for each prefix (left to right) combination of key columns. The optimizer uses this combination of an index's histogram and densitiesits statisticsto determine how useful the index is in resolving a particular query.

The fact that a histogram is only stored for the first column of a composite index is one of the reasons you should position the most selective columns first in a multicolumn index: The histogram will be more useful to the optimizer. Moreover, this is also the reason that splitting up composite indexes into multiple single-column indexes is sometimes advisable. Because the server can intersect and join multiple indexes on a single table, you retain the benefits of having the columns indexed, and you get the added benefit of having a histogram for each column (column statistics can help out here, as well). This isn't a blanket statement. Don't run out and drop all your composite indexes. Just keep in mind that breaking down composite indexes is sometimes a viable performance tuning option.

Columns Statistics

Besides index statistics, SQL Server can also create statistics on nonindexed columns (which happens automatically when you query a nonindexed column while AUTO_CREATE_STATISTICS is enabled for the database). Being able to determine the likelihood that a given value may occur in a column gives the optimizer valuable information in determining how best to service a query. It allows the optimizer to estimate the number of rows that will qualify from a given table involved in a join, allowing it to select join order more accurately. Also, the optimizer can use column statistics to provide histogram-type information for the other columns in a multicolumn index. Basically, the more information you can give the optimizer about your data, the better.

Listing Statistics

SQL Server uses statistics to track the distribution of key values across a table. The histogram that's stored as part of an index's statistics contains a sampling of as many as 200 values for the index's first key column. Besides the histogram, statblob also contains:

The number of rows on which the histogram and densities are based
The average length of the index key
The date and time of the last statistics generation
Density values for other prefix combinations of key columns

The range of key values between each of the 200 histogram sample values is called a step. Each sample value denotes the end of a step, and each step stores three values:

EQ_ROWSThe number of rows with a key value matching the sample value
RANGE_ROWSThe number of other values inside the range
RANGE_DENSITYA density computation for the range itself

DBCC SHOW_STATISTICS lists the EQ_ROWS and RANGE_ROWS values verbatim and uses RANGE_DENSITY to compute DISTINCT_RANGE_ROWS and AVG_RANGE_ROWS for the step. It computes DISTINCT_RANGE_ROWS (the total number of distinct rows within the step's range) by dividing 1 by RANGE_DENSITY, and computes AVG_RANGE_ROWS (the average number of rows per distinct key value) by multiplying RANGE_ROWS by RANGE_DENSITY.

Updating Statistics

Statistics can be updated in a couple of ways. The first and most obvious is through the AUTO_UPDATE_STATISTICS database option (you can turn this on via ALTER DATABASE or sp_dboption). When statistics are generated automatically, SQL Server uses sampling (as opposed to scanning the entire table) to speed up the process. This works in the vast majority of cases, but can sometimes lead to statistics that are less useful than they could be.

Closely related to automatic statistics updating is automatic statistics creation. This occurs when the AUTO_CREATE_STATISTICS database option has been enabled and you issue a query that filters on a nonindexed column. SQL Server will automatically create a set of column statistics for you.

The second method of updating statistics is through the UPDATE STATISTICS command. UPDATE STATISTICS was the only way to update statistics prior to SQL Server 7.0. UPDATE STATISTICS can either use sampling, as happens with automatic updating, or it can do a full scan of the table, resulting in better statistics, but likely taking longer.

Closely related to UPDATE STATISTICS is the CREATE STATISTICS command. You use CREATE STATISTICS to create column statistics manually. Once created, these statistics can be updated through automatic updating or via UPDATE STATISTICS, just as regular index statistics can.

SQL Server provides a few stored procedures to make creating and updating statistics easier. Sp_updatestats runs UPDATE STATISTICS against all user -defined tables in the current database. Unlike the UPDATE STATISTICS command itself, though, sp_updatestats cannot issue a full scan of a table to build statisticsit always uses sampling. If you want full-scan statistics, you have to use UPDATE STATISTICS.

Sp_createstats is similarly handy. It can automate the creation of column statistics for all eligible columns in all eligible tables in a database. Eligible columns include noncomputed columns with data types other than text, ntext, or image that do not already have column or first-column index statistics. Eligible tables include all user (nonsystem) tables.

Sp_autostats allows you to control automatic statistics updating at the table and index levels. Rather than simply relying on the AUTO_UPDATE_STATISTICS database option, you can enable/disable automatic statistics generation at a more granular level. For example, if you run a nightly job on a large table to update its statistics using a full scan, you may want to disable automatic statistics updates for the table. Using sp_autostats, you can disable automatic statistics updates on this one table, and leave it enabled for the rest of the database. Statistics updates on large tables, even those that use sampling, can take a while to run and can use significant CPU and I/O resources.

Keep in mind that the negative impact on performance of not having statistics or having out-of-date statistics far outweighs the performance benefits of avoiding automatic statistics updates/creation. You should only disable auto update/create stats when thorough testing has shown that there's no other way to achieve the performance or scalability you require.

sp_showstatdate

Listing 17-8 shows a stored procedure that you can use to stay on top of statistics updates. It shows the statistics type, the last time it was updated, and a wealth of other information that you may find useful in managing index and column statistics. Here's the code:

Listing 17-8 The sp_showstatdate procedure helps stay on top of statistics updates.

 CREATE PROC sp_showstatdate @tabmask sysname='%', @indmask sysname='%' AS   SELECT   LEFT(CAST(USER_NAME(uid)+'.'+o.name AS sysname),30) AS TableName,   LEFT(i.name,30) AS IndexName,   CASE WHEN INDEXPROPERTY(o.id,i.name,'IsAutoStatistics')=1 THEN 'AutoStatistics'        WHEN INDEXPROPERTY(o.id,i.name,'IsStatistics')=1 THEN 'Statistics'    ELSE 'Index'    END AS Type,   STATS_DATE(o.id, i.indid) AS StatsUpdated,   rowcnt,   rowmodctr,   ISNULL(CAST(rowmodctr/CAST(NULLIF(rowcnt,0) AS decimal(20,2))*100 AS int),0) AS PercentModifiedRows,   CASE i.status & 0x1000000   WHEN 0 THEN 'No'   ELSE 'Yes'   END AS [NoRecompute?],   i.status FROM dbo.sysobjects o JOIN dbo.sysindexes i ON (o.id = i.id) WHERE o.name LIKE @tabmask  AND i.name LIKE @indmask  AND OBJECTPROPERTY(o.id,'IsUserTable')=1  AND i.indid BETWEEN 1 AND 254 ORDER BY TableName, IndexName GO USE pubs GO EXEC sp_showstatdate

(Results abridged)

 TableName         IndexName       Type       StatsUpdated ----------------- --------------- ---------- ------------------------- dbo.authors       au_fname        Statistics 2000-07-02 19:42:04.487 dbo.authors       aunmind         Index      2000-06-30 20:54:56.737 dbo.authors       UPKCL_auidind   Index      2000-06-30 20:54:56.737 dbo.dtproperties  pk_dtproperties Index      NULL dbo.employee      employee_ind    Index      2000-06-30 20:54:45.280 dbo.employee      PK_emp_id       Index      2000-06-30 20:54:45.297

for RuBoard