We've noted that cost-based optimizers need volatile data in the system catalog. That data is called statistics (no, we don't know why it's called statistics). The DBMS has utilities and/or non-standard SQL statements for keeping statistics up to date, but they're slowupdating statistics for a table can take as long as creating an index for the table. Therefore, the updating is never automatically performed by the DBMS. Someone (usually the DBA) has to issue an "update statistics" instruction when it becomes clear that statistics are out of date. Careful administrators believe that statistics updating should happen after about 10% of the rows have been changed due to INSERT, UPDATE, or DELETE statements, or after any CREATE INDEX statement, and they will schedule automatic "update statistics" jobs accordingly . The DBMSs that sport cost-based optimizers keep most or all of the following statistics:
All these statistics are easy to calculate and take very little storage space in the system catalog. So what's tough? Histograms. A histogram is detailed information on the distribution of values over a column. For example, suppose you have the following table: CREATE TABLE Table1 ( column1 VARCHAR(15), column2 INTEGER) Both columns of Table1 are indexed. A subset of the values contained in Table1.column1 is shown in Table 17-2. The classic histogram is an ordered list of values with the number of times each value occurs. Table 17-3 shows the histogram for Table1.column1 . It's possible to get a classic histogram with this type of query: SELECT column1, COUNT(*) FROM Table1 GROUP BY column1 But the result is large and one objective is to store all statistics in RAM. So the DBMS uses a compression method instead. One compression method stores singletons (values with only one occurrence) separately, or not at all. Another compression method takes samples of every nth row. This is easy if there's an index; the DBMS just reads the top level of the B-tree. Once the histogram is in place, if someone executes this SQL statement: SELECT * FROM Table1 WHERE column1 = 'Turkmenistan' the DBMS can do a quick in-RAM binary search of the histogram for column1 during the optimization phase and find out how selective the search condition is. Usually this is more than enough information to form an access plan. Typically there will be histograms for all the important columns (which are often the same as the indexed columns). For other columns, the optimizer will depend on the easier-to-get statistics. Table 17-2. Table1's Data
Table 17-3. Histogram for Table1.column1
Once the optimizer has looked up the statistics, it can plug them into formulas. There is one formula for each ANDed or ORed search condition in a WHERE clause. First the optimizer uses a formula to calculate the cost of each step. Then the optimizer uses a formula to calculate the output from each step, so that if there are two steps, the optimizer will know what the input size and order are for the second step. As an example, consider this query: SELECT * FROM Table1 WHERE column1 = 'Waziristan' AND column2 = 55 Assume that Table1 contains 100 rows and the value 55 appears 60 times in the index for column2 . With the statistics available to it, the optimizer can determine that column1 is indexed and can also determine (from the column1 histogram) that approximately one row will be returned. It can also determine that column2 is indexed and that column2 's histogram says that the value 55 occurs about 60 times. Thus the optimizer has a choice of three different access plans:
Clearly, Plan #2 has the smallest cost, so the optimizer throws the other plans away. They will not be available at execution time. This is a trivial example. It should be emphasized that the DBMS's biggest advantage from a cost-based optimizer becomes evident only when an SQL statement uses a join. Because there are possible plans for each join step, and because the number of possible plans can rise exponentially (for example, a four-way join has 4! [four factorial] plans to choose from), a WHERE clause with a multi-way join and a few semi-joins (subqueries) is too hard for a human. Your job is to optimize everything else, ensure the DBMS has what it needs, and let the optimizer do its job. Analyzing StatisticsAnalyzing the DBMS statistics can tell you when reorganization is necessary. Here are some things to look for.
Ideally, you should rebind application programs after running statistics, because the optimizer may choose a different access plan for your queries, given the new statistics.
|