< Day Day Up > |
The optimizer performs complex calculations based on a host of information. To simplify the functionality of the optimizer, you can picture it as performing a four-step process:
The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that can be sent its way? The optimizer has many types of strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? IBM does not publish the exact details and logic used by the optimizer, but the optimizer is cost-based. This means that the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the DB2 optimizer evaluates and weighs four factors for each potential access path: the CPU cost, the I/O cost, the DB2 Catalog statistics, and the SQL statement itself. CPU CostThe optimizer tries to determine the cost of execution of each access path strategy for the query being optimized. Based on the serial number of the CPU, the optimizer estimates the CPU time required to accomplish the tasks associated with the access path it is analyzing. As it calculates this cost, it determines the costs involved in applying predicates, traversing pages (index and tablespace), and sorting. I/O CostThe optimizer estimates the cost of physically retrieving and writing the data. In so doing, the optimizer estimates the cost of I/O by using a series of formulas based on the following data: DB2 Catalog statistics, the size of the buffer pools, and the cost of work files used (sorting, intermediate results, and so on). These formulas result in a filter factor, which determines the relative I/O cost of the query. Filter factors are covered in more detail in the "Filter Factors" section, later in this chapter. DB2 Catalog StatisticsWithout the statistics stored in the DB2 Catalog, the optimizer would have a difficult time optimizing anything. These statistics provide the optimizer with information pertinent to the state of the tables that will be accessed by the SQL statement that is being optimized. A complete listing of the DB2 Catalog statistics and values used by the optimizer is in Table 21.1. Partition-level statistics are used when determining the degree of parallelism for queries using I/O, CP, and Sysplex parallelism. Table 21.1. DB2 Catalog Columns Analyzed by the Optimizer
The columns that have an F at the end of their names were changed from INTEGER columns to FLOAT columns (as of DB2 V5). This enabled DB2 to store larger values in these columns. The largest value that can be stored in an INTEGER column is 2,147,483,647 ; a floating point column can store values up to 7.2 x 10 75 ”a very large number indeed. RUNSTATS can also keep track of correlated columns. Correlated columns have values that are related to one another. An example of a set of correlated columns is CITY , STATE , and ZIP_CODE . For example, the combination of CHICAGO for CITY and IL for STATE is much more likely to occur than CHICAGO and AK . Such correlations are useful for the optimizer to analyze when formulating access paths. The RUNSTATS utility also generates information about the frequency of data values. This information is useful to the optimizer because a different access path may be chosen for a predicate when retrieving a value that occurs 40% of the time than for the same predicate when retrieving a value that occurs only 2% of the time. The SQL StatementThe formulation of the SQL statement also enters into the access path decisions made by the optimizer. The complexity of the query, the number and type of predicates used (Stage 1 versus Stage 2), the usage of column and scalar functions, and the presence of ordering clauses ( ORDER BY , GROUP BY , and DISTINCT ) enter into the estimated cost that is calculated by the optimizer. Additionally, certain SQL formulations are recognized as optimizer tweaks ”for example, adding OR 0=1 to a predicate to affect indexing. |
< Day Day Up > |