As mentioned earlier, the selectivity of a key is an important factor that determines whether an index will be used for a query. SQL Server stores the selectivity and the histogram of sample values of the key in the statblob column on the sysindexes system table. Based on the values stored in this column for the index, and the SARGs specified for the query, the query optimizer decides which index to use. The statblob column is an image column. To see the statistical information stored in the statblob column, use the DBCC SHOW_STATISTICS command, which returns the following pieces of information:
The syntax for DBCC SHOW_STATISTICS is as follows : DBCC SHOW_STATISTICS ( tablename, index ) Listing 34.1 displays the statistical information for the aunmind nonclustered index on the au_lname and au_fname columns of the authors table. Listing 34.1 DBCC SHOW_STATISTICS Output for the aunmind Index on the authors Tabledbcc show_statistics (authors, aunmind ) go Statistics for INDEX 'aunmind'. Updated Rows Rows Sampled Steps Density Average key length -------------------- ------ ------------ ----- ---------- ------------------ Aug 6 2001 1:34AM 23 23 22 0.0 24.52174 All density Average Length Columns ------------------------ ------------------------ -------------------------- 4.5454547E-2 7.3913045 au_lname 4.3478262E-2 13.52174 au_lname, au_fname 4.3478262E-2 24.52174 au_lname, au_fname, au_id (3 row(s) affected) RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS ----------------- ------------ ------------ -------------------- -------------- Bennet 0.0 1.0 0 0.0 Blotchet-Halls 0.0 1.0 0 0.0 Carson 0.0 1.0 0 0.0 DeFrance 0.0 1.0 0 0.0 del Castillo 0.0 1.0 0 0.0 Dull 0.0 1.0 0 0.0 Green 0.0 1.0 0 0.0 Greene 0.0 1.0 0 0.0 Gringlesby 0.0 1.0 0 0.0 Hunter 0.0 1.0 0 0.0 Karsen 0.0 1.0 0 0.0 Locksley 0.0 1.0 0 0.0 MacFeather 0.0 1.0 0 0.0 McBadden 0.0 1.0 0 0.0 O'Leary 0.0 1.0 0 0.0 Panteley 0.0 1.0 0 0.0 Ringer 0.0 2.0 0 0.0 Smith 0.0 1.0 0 0.0 Straight 0.0 1.0 0 0.0 Stringer 0.0 1.0 0 0.0 White 0.0 1.0 0 0.0 Yokomoto 0.0 1.0 0 0.0 Looking at the output, you can determine that the statistics were last updated on Aug 6, 2001. The table at the time the statistics were generated had 23 rows, and all 23 rows were sampled to generate the statistics. The average key length is 24.52174 bytes. From the density information, you can see that this index is highly selective (a low density means high selectivityindex densitites will be covered shortly). Of the 23 rows in the table, 22 of them are unique values. After the general information and the index densities, the index histogram is displayed. The Statistics HistogramUp to 200 sample values can be stored in the statistics histogram. Each sample value is called a step. The sample value stored in each step is the endpoint of a range of values. Three values are stored for each step, described as follows:
In the output in Listing 34.1, all the distinct key values in the first column of the index are stored as the sample values stored in the histogram. Therefore, there are no data values between the sample values in the histogram, and subsequently all the range values are 0. You can see that there is a duplicate in the index key for the last name of Ringer ( EQ_ROWS = 2). For comparison purposes, Listing 34.2 shows a snippet of the DBCC SHOW_STATISTICS output for the sales table in bigpubs2000 . Listing 34.2 DBCC SHOW_STATISTICS Output for the titleidind Index on the sales Table in the bigpubs2000 DatabaseStatistics for INDEX 'titleidind'. Updated Rows Rows Sampled Steps Density Average key length -------------------- ------ ------------ ----- ------------ ------------------ Aug 21 2001 11:18PM 168725 168725 200 1.8955356E-3 26.405577 (1 row(s) affected) All density Average Length Columns ------------------------ ------------------------ ----------------------------- 1.8621974E-3 6.0 title_id 5.997505E-6 10.0 title_id, stor_id 5.9268041E-6 26.405577 title_id, stor_id, ord_num (3 row(s) affected) RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS ------------ ---------- ------- ------------------- ------------------- BI0194 0.0 314.0 0 0.0 BI2184 613.0 343.0 2 306.5 BI2574 270.0 277.0 1 270.0 BI3224 618.0 286.0 2 309.0 BI3976 311.0 293.0 1 311.0 BI6450 673.0 300.0 2 336.5 BI9506 947.0 292.0 3 315.66666 BU1111 296.0 299.0 1 296.0 BU7832 349.0 334.0 1 349.0 CH0249 1011.0 311.0 3 337.0 CH0639 984.0 307.0 3 328.0 ... TC4203 0.0 321.0 0 0.0 TC7777 0.0 297.0 0 0.0 (200 row(s) affected) As you can see in this example, there are a greater number of rows per range and a greater number of duplicates for each step value. Also, all 200 rows in the histogram are used, and the 168,725 rows in the table are distributed across those 200 rows. All 168,725 were sampled to generate the statistics. The histogram steps can be used for SARGs only when a constant expression is compared against an indexed column and the value of the constant expression is known at query compile time. Examples of SARGs where histogram steps can be used include
Some constant expressions cannot be evaluated until query runtime. These include search arguments containing local variables or subqueries and also join clauses such as
For these types of statements, you need some other way of estimating the number of matching rows. Additionally, because histogram steps are kept only on the first column of the index, SQL Server must use a different method for determining the number of matching rows for SARGs that specify multiple column values for a composite index, such as the following: select * from sales where title_id = 'BI3976' and stor_id = 'P648' When the histogram is not used, or cannot be used, SQL Server uses the index density values to estimate the number of matching rows. Index DensitiesSQL Server stores the density values of each column in the index for use in queries where the SARG value is not known until runtime, or when the SARG is on multiple columns of the index. For composite keys, SQL Server stores the density for the first column of the composite key; for the first and second columns; for the first, second, and third columns; and so on. This information can be seen in the All density section of the DBCC SHOW_STATISTICS output in Listing 34.1. Index density essentially represents the inverse of all unique key values of the key. The density of each key is calculated by the following formula:
Therefore, the density for the state column in the authors table in the pubs database is calculated as follows: Select Density = 1.00/ (select count(distinct state) from authors) Go Density ---------------- .1250000000000 The density for the combination of the columns state and zip is as follows: Select Density = 1.00/ (select count(distinct state+zip) from authors) Go Density ---------------- .0555555555555 Notice that, unlike the selectivity ratio, a smaller index density indicates a more selective index. As the density value approaches 1, the index becomes less selective and essentially useless. When the index selectivity is poor, the optimizer might choose to do a table scan, or a leaf level index scan, rather than perform an index seek, because it is more cost effective.
Typically, the density value should become smaller (that is, more selective) as you add more columns to the key. For example, in Listing 34.2, the densities get progressively smaller (and thus, more selective) as additional columns are factored in, as shown in Table 34.1. Table 34.1. Index Densities for titleidind Index on Sales Table
Estimating Rows Using the Index StatisticsSo how does the optimizer use the index statistics to determine the effectiveness of the index? SQL Server uses the histogram information when searching for an index key value across a range or when there are duplicate values in the key. Consider a query on the sales table in the bigpubs2000 database: select * from sales where title_id = 'BI2184' Because there are duplicates of title_id in the table, SQL Server uses the histogram on title_id (refer to Listing 34.2) to estimate the number of matching rows. For the value of BI2184 , it would look at the EQ_ROWS value, which is 343.0. This indicates that there are 343 rows in the table with a title_id value of BI2184 . When an exact match for the search argument is not found as a step in the histogram, SQL Server uses the AVG_RANGE_ROWS value for the next step greater than the search value. For example, SQL Server would estimate that for a search value of 'BI2187' , on average, it would match 270.0 rows. For a range retrieval, SQL Server sums up the RANGE_ROWS and EQ_ROWS values between the endpoints of the range retrieval. For example, using the histogram in Listing 34.2, if the search argument was where title_id <= 'BI2574' , the row estimate would be 314 + 613 + 343 + 270 + 277, or 1,817. When the histogram cannot be used, SQL Server uses just the index density to estimate the number of matching rows. The formula is straightforward for an equality search, such as declare @tid varchar(6) select @tid = 'BI2574' select count(*) from sales where title_id = @tid The row estimate is the index density ( 1.8621974E-3 ) for the specified key multiplied by the number of rows in the table: select count(*) * 1.8621974E-3 from sales go ------------------- 314.19925631500001 If a query specifies both the title_id and stor_id as SARGs, and if the SARG for title_id is a constant expression that can be evaluated at optimization time, SQL Server uses both the index density on title_id and stor_id and the histogram on title_id to estimate the number of matching rows (for some data values, the estimated number of matching rows for title_id and stor_id calculated using the index density could be greater than the estimated number of rows that match the specific title_id as determined by the histogram). SQL Server uses whichever is the smaller of the two to calculate the row estimate. Looking at the index density for title_id, stor_id , you can see that it is nearly unique: select count(*) * 5.997505E-6 from sales ----------------------------------------------------- 1.011929031125 In this example, SQL Server would use the index density on title_id and stor_id to estimate the number of matching rows. In this case, it is estimated that the query will return one matching row. Generating and Maintaining Index StatisticsNow, the questions you might ask are "How do the index statistics get created, and how are they maintained ?" The index statistics are first created when you create the index on a table or when you run the UPDATE STATISTICS command. In versions of SQL Server prior to version 7.0, index statistics information did not get updated automatically. If you inserted many rows after the index had already been created, the histogram information reflected by the index statistics did not reflect the actual key distribution. As a result, the optimizer sometimes chose an access plan that was inefficient. As part of regular maintenance, DBAs had to create a schedule for running the UPDATE STATISTICS command to keep the index statistics up-to-date with the table data. As of SQL Server 7.0, index statistics are automatically updated by SQL Server. SQL Server constantly monitors the update activity on the indexed key values and updates the statistics through an internal process when appropriate. Auto-Update StatisticsThe AutoStat process monitors the updates to a table to determine when it should run. A column in the sysindexes table called rowmodctr maintains a running total of the number of modifications made to a table that affect the index statistics. The value in this column is incremented under the following circumstances:
Whenever the index statistics have been updated for the table, rowmodctr is reset to 0. The threshold value for rowmodctr that triggers an AutoStat update is governed by the size and the nature of the table. The simple algorithm is as follows:
When AutoStats generates an update of the index statistics, it generates the new statistics based on a sampling of the data values in the table. Sampling helps to minimize the overhead of the AutoStats process. The sampling is random across the data pages, and the values are taken from the table or the smallest nonclustered index on the columns needed to generate the statistics. After a data page containing a sampled row has been read from disk, all the rows on the data page are used to update the statistical information.
To determine how often the AutoStats process is being run, you can turn on trace flag 8721, which generates output in the errorlog, similar to the following, when an update of the index statistics occurs as a result of rowmodctr reaching the threshold for the table: [View full width]
You can also use the SQL Profiler to determine when an automatic update of index statistics is occurring by monitoring the Auto Stats event in the Objects event class (for more information on using SQL Profiler, see Chapter 7). If necessary, it is possible to turn off the AutoStats behavior by using the sp_autostats system stored procedure. This stored procedure allows you to turn the automatic updating of statistics on or off for a specific index or all the indexes of a table. The following command turns off the automatic update of statistics for an index named aunmind on the authors table: Exec sp_autostats 'authors', 'OFF', 'aunmind' go Automatic statistics maintenance turned OFF for 1 indices. When you run sp_autostats and simply supply the table name, it displays the current setting for the table, as well as the database. Following are the settings for the authors table: Exec sp_autostats 'authors' Go Global statistics settings for [pubs]: Automatic update statistics: ON Automatic create statistics: ON Settings for table [authors] Index Name AUTOSTATS Last Updated -------------------------- --------- ------------------------- [UPKCL_auidind] ON 2001-09-04 23:25:24.217 [aunmind] OFF 2001-09-04 23:24:31.497 [_WA_Sys_state_75D7831F] ON 2001-09-04 23:24:31.497 [st_ind] ON 2001-09-04 23:24:31.497 There are three other ways to disable auto-updating of statistics for an index:
You can also turn AutoStats on or off for the entire database by setting the database option in Enterprise Manager (bring up the Properties dialog box and click on the Options tab and check or uncheck the Auto Update Statistics option). You can also disable or enable the AutoStats option for a database using the sp_dboption procedure: sp_dboption dbname, 'auto update statistics' [, { 'ON' 'OFF' } ]
Manually Updating StatisticsWhether you've disabled AutoStats or not, you can still manually update index statistics by using the UPDATE STATISTICS T-SQL command. In SQL Server 2000, this command was enhanced to support more features. The new syntax of this command is UPDATE STATISTICS { table } [ index ( statistics_name [, ... n ]) [WITH [FULLSCAN] SAMPLE number {PERCENT ROWS}] RESAMPLE ] ][[,] NORECOMPUTE][[,] [INDEX COLUMNS ALL] If neither the FULLSCAN nor SAMPLE options are specified, the default behavior is to perform a sample scan to calculate the statistics. SQL Server automatically computes the appropriate sample size. For tables less than 8MB in size, the sample will always be the entire table. The FULLSCAN option forces SQL Server to perform a full scan of the data in the table or index to calculate the statistics. This generates more accurate statistics than using sampling but is also the most time consuming and I/O intensive . When you use the SAMPLE option, you can specify a fixed number of rows or percentage of rows to sample to build or update the index statistics. If the sampling ratio specified ever results in too few rows being sampled, SQL Server automatically corrects the sampling based on the number of existing rows in the table. The RESAMPLE option specifies that the statistics be generated using the previously defined sampling ratio. This RESAMPLE option is useful when you have indexes or column statistics that were created with different sampling values. For example, if the index statistics were created using FULLSCAN , and the column statistics were created using a 50 percent sample, specifying the RESAMPLE option would update the statistics on the indexes using FULLSCAN and using a 50 percent sample for the others. As previously discussed, SQL Server 2000 automatically updates the index statistics by default. If you specify the NORECOMPUTE option with UPDATE STATISTICS , it disables AutoStats for the table or index. When the automatic update statistics option is turned off, you should run the UPDATE STATISTICS command periodically when appropriate. To determine the last time statistics were updated, run the following command: select STATS_DATE( tableid, indexid ) The following is an example: select STATS_DATE(object_id('authors'), 1) go ------------------------------------------------------ 2001-08-21 23:17:50.510
Creating StatisticsIn addition to statistics on indexes, SQL Server can also store statistics on individual columns that are not part of any indexes. Knowing the likelihood of a particular value being found in a nonindexed column can help the optimizer determine the optimal execution plan, whether or not SQL Server is using an index to actually locate the rows. For example, consider the following query: select stor_name from stores st join sales s on (st.stor_id = s.stor_id) where s.qty <= 100 SQL Server knows the density of the stor_id column in both the sales and stores tables because of indexes on the column in those tables. There is no index on qty . However, if the optimizer were to know how many rows in the sales table had qty less than 100, it would be better able to choose the most efficient query plan for joining between sales and stores . For example, assume that on average, there are approximately 500 sales per store. However, there are only approximately 5 sales per store where the qty is less than 100. With the statistics on qty , SQL Server has the opportunity to determine this and knowing there might be only 5 matching rows in sales versus 500, it might choose a different, more efficient, join strategy between the two tables. Being able to keep statistics on the qty column without having to add it to an existing index with stor_id , or create a separate index on qty , provides SQL Server with the selectivity information it needs for optimization. By not having to create an index on qty , you avoid incurring the overhead of having to maintain the index key rows for each insert, update, and delete that occurs on the table. Only the index statistics on qty need to be maintained, which is required only after many modifications to the data have occurred. By default, column statistics are generated automatically in SQL Server when queries are optimized. These statistics are created whenever no column statistics exist and the column is specified in a SARGs or join clause and the optimizer needs to estimate the approximate density or distribution of column values. This rule has two exceptions:
If you want to disable or re-enable the database option to auto-create statistics in the database, use the sp_dboption procedure: sp_dboption dbname, 'auto create statistics' [, { 'ON' 'OFF' } ] You can also turn the auto-create statistics option on or off for the entire database by setting the database option in Enterprise Manager. Bring up the Properties dialog box for the database. Click on the Options tab and check or uncheck the Auto create statistics option. Column statistics are stored in the sysindexes table with an ID between 2 and 254. Auto-generated statistics will have a name in the format "_WA_Sys_ colname _ systemgeneratednumber " . You can retrieve a list of auto-generated column statistics with a query similar to the following: select cast(object_name(id) as varchar(30)) as 'table', name, indid from sysindexes where indexproperty(id, name, 'IsAutoStatistics') = 1 go table name indid ------------------------------ --------------------------- ------ authors _WA_Sys_state_4AB81AF0 3 sales _WA_Sys_ord_num_628FA481 3 stores _WA_Sys_state_6477ECF3 3 stores _WA_Sys_zip_6477ECF3 4 titles _WA_Sys_type_6A30C649 3 If you want finer control over how the column statistics are generated, you can use the CREATE STATISTICS command. The syntax is similar to UPDATE STATISTICS with the exception that you specify a column or list of columns instead of an index to create statistics on: CREATE STATISTICS statistics_name ON table ( column [ ,... n ] ) [ WITH [ [ FULLSCAN SAMPLE number PERCENT ] [,] ] [ NORECOMPUTE] ] You cannot create statistics on computed columns or columns of the ntext , text , or image datatypes. If you want to create single-column statistics on all eligible columns in a database, you can use the sp_createstats system procedure: sp_createstats [[@indexonly =] 'indexonly'] [,[@fullscan =] 'fullscan'] [,[@norecompute =] 'norecompute'] The created statistics will have the same name as the column on which they are created. Statistics are not created on columns that already have statistics on them (for example, the first column of an index or a column that already has explicitly created statistics). To display a list of all column statistics, whether auto-generated or manually created, use a query similar to the previous one, but specify the 'IsStatistics' index option instead of 'IsAutoStatistics' : select cast(object_name(id) as varchar(30)) as 'table', name, indid from sysindexes where indexproperty(id, name, 'IsStatistics') = 1 go table name indid ------------------------------ ------------------------------ ------ authors _WA_Sys_state_4AB81AF0 3 sales _WA_Sys_ord_num_628FA481 3 sales ord_date 4 sales qty 5 sales payterms 6 stores _WA_Sys_state_6477ECF3 3 stores _WA_Sys_zip_6477ECF3 4 stores stor_name 5 stores stor_address 6 titles _WA_Sys_type_6A30C649 3
To remove a collection of statistics on one or more columns for a table in the current database, use the DROP STATISTICS command: DROP STATISTICS table.statistics_name Be aware that dropping the column statistics could affect how your queries will be optimized, and less efficient query plans might be chosen . Also, if the auto-create statistics option is enabled for the database, SQL Server will likely automatically create statistics on the columns the next time they are referenced in a SARG or join clause for a query. |