Statistics and Histograms

We've noted that cost-based optimizers need volatile data in the system catalog. That data is called statistics (no, we don't know why it's called statistics). The DBMS has utilities and/or non-standard SQL statements for keeping statistics up to date, but they're slowupdating statistics for a table can take as long as creating an index for the table. Therefore, the updating is never automatically performed by the DBMS. Someone (usually the DBA) has to issue an "update statistics" instruction when it becomes clear that statistics are out of date. Careful administrators believe that statistics updating should happen after about 10% of the rows have been changed due to INSERT, UPDATE, or DELETE statements, or after any CREATE INDEX statement, and they will schedule automatic "update statistics" jobs accordingly .

The DBMSs that sport cost-based optimizers keep most or all of the following statistics:

The number of data pages.
The number of index pages.
The number of index levels.
The number of leaf pages in each index.
The number of distinct values in the first column of a compound index. There is also a statistic for the number of distinct values in the first two columns of a compound index, the number of distinct values in the first three columns of a compound index, and so on.
The number of pages that are not empty.
The number of rows that have been moved from their original page to other (overflow) pages.
The degree of clustering of an index, that is, the extent to which the physical sequence of rows in a table actually follows the order of an index.
The number of rows in a table.
The number of columns in a table.
The average length of a column's data.
The number of NULLs in a column.
The percentage of distinct values in a column. This is called "selectivity" and usually is available for indexed columns only.
The number of occurrences of frequently used column values.
The second-smallest and second-largest values for a column. Experience shows that MIN and MAX values are often far outside the regular distribution, hence the choice of second-smallest and second-largest to distinguish a column's normal range of values.

All these statistics are easy to calculate and take very little storage space in the system catalog. So what's tough? Histograms.

A histogram is detailed information on the distribution of values over a column. For example, suppose you have the following table:

 CREATE TABLE Table1 (    column1 VARCHAR(15),    column2 INTEGER)

Both columns of Table1 are indexed. A subset of the values contained in Table1.column1 is shown in Table 17-2.

The classic histogram is an ordered list of values with the number of times each value occurs. Table 17-3 shows the histogram for Table1.column1 .

It's possible to get a classic histogram with this type of query:

 SELECT column1, COUNT(*)   FROM Table1   GROUP BY column1

But the result is large and one objective is to store all statistics in RAM. So the DBMS uses a compression method instead. One compression method stores singletons (values with only one occurrence) separately, or not at all. Another compression method takes samples of every nth row. This is easy if there's an index; the DBMS just reads the top level of the B-tree.

Once the histogram is in place, if someone executes this SQL statement:

 SELECT * FROM Table1   WHERE column1 = 'Turkmenistan'

the DBMS can do a quick in-RAM binary search of the histogram for column1 during the optimization phase and find out how selective the search condition is. Usually this is more than enough information to form an access plan. Typically there will be histograms for all the important columns (which are often the same as the indexed columns). For other columns, the optimizer will depend on the easier-to-get statistics.

Table 17-2. Table1's Data

column1
Uzbekistan
Tajikistan
Afghanistan
Pakistan
Baluchistan
Turkmenistan
Kirghizstan
Khalistan
Pushtunistan
Waziristan
Afghanistan
Turkmenistan
Uzbekistan
Afghanistan
Khalistan
Afghanistan

Table 17-3. Histogram for `Table1.column1`

column1 Value	Number of Occurrences
Afghanistan	4
Baluchistan	1
Khalistan	2
Kirghizstan	1
Pakistan	1
Pushtunistan	1
Tajikistan	1
Turkmenistan	2
Uzbekistan	2
Waziristan	1

Once the optimizer has looked up the statistics, it can plug them into formulas. There is one formula for each ANDed or ORed search condition in a WHERE clause. First the optimizer uses a formula to calculate the cost of each step. Then the optimizer uses a formula to calculate the output from each step, so that if there are two steps, the optimizer will know what the input size and order are for the second step. As an example, consider this query:

 SELECT * FROM Table1   WHERE column1 = 'Waziristan'     AND column2 = 55

Assume that Table1 contains 100 rows and the value 55 appears 60 times in the index for column2 . With the statistics available to it, the optimizer can determine that column1 is indexed and can also determine (from the column1 histogram) that approximately one row will be returned. It can also determine that column2 is indexed and that column2 's histogram says that the value 55 occurs about 60 times. Thus the optimizer has a choice of three different access plans:

Plan #1

Lookup column1 in its index.

Lookup column2 in its index.

Merge results of the lookups.
Plan #2

Lookup column1 in its index.

For each result, read column2 in the table and filter out if the value isn't 55 .
Plan #3

Lookup column2 in its index.

For each result, read column1 in the table and filter out if the value isn't Waziristan .

Clearly, Plan #2 has the smallest cost, so the optimizer throws the other plans away. They will not be available at execution time.

This is a trivial example. It should be emphasized that the DBMS's biggest advantage from a cost-based optimizer becomes evident only when an SQL statement uses a join. Because there are possible plans for each join step, and because the number of possible plans can rise exponentially (for example, a four-way join has 4! [four factorial] plans to choose from), a WHERE clause with a multi-way join and a few semi-joins (subqueries) is too hard for a human. Your job is to optimize everything else, ensure the DBMS has what it needs, and let the optimizer do its job.

Analyzing Statistics

Analyzing the DBMS statistics can tell you when reorganization is necessary. Here are some things to look for.

Degree of clustering. In general, only one of the indexes in a table can have a high degree of clustering. If your DBMS collects cluster ratio statistics, check them occasionally. A low cluster ratio leads to more I/O, since after the first access of each data page, it is less likely that the page is still in the buffer pool the next time it is accessed. (The cluster ratio shows the percentage of data rows that are in the same page order as the index keys.)
Row overflow. This number shows how many rows are no longer present on their original page because of an expanding or shrinking update. In such cases, a pointer is kept at the row's original location. This can hurt performance, because the DBMS must follow the pointer to find the row's contents. Processing time is increased; more I/Os may be needed to get the data. As the number of overflow rows increases, the potential benefit of reorganizing your table also increases .
Empty versus total pages. Compare the figure for the number of pages with rows to the total number of pages that a table contains. Empty pages occur when entire ranges of rows are deleted. Empty pages will still be read for a table scan. As the number of empty pages grows higher, so does the need for a table reorganization.
Number of index leaf pages. This number predicts how many index page I/Os are needed for a complete scan of the index. Recall that random data changes can cause leaf page splits , which increase the size of an index beyond the minimum amount of space required. Rebuild your indexes to solve this problem. Don't forget to use PCTFREE/FILLFACTOR so that your DBMS will leave room in the rebuilt index for future data changes.

Ideally, you should rebind application programs after running statistics, because the optimizer may choose a different access plan for your queries, given the new statistics.

Multi-Column Statistics

Histograms are for columns, not combinations of columns. Even if you have a multi-column compound index, the DBMS will prefer to calculate full histograms for the first column only.

Table 17-2. Table1's Data

Table 17-3. Histogram for Table1.column1

Analyzing Statistics

Multi-Column Statistics

Table 17-3. Histogram for `Table1.column1`