How the Optimizer Works

 <  Day Day Up  >  

The optimizer performs complex calculations based on a host of information. To simplify the functionality of the optimizer, you can picture it as performing a four-step process:

  1. Receive and verify the SQL statement.

  2. Analyze the environment and optimize the method of satisfying the SQL statement.

  3. Create machine-readable instructions to execute the optimized SQL.

  4. Execute the instructions or store them for future execution.

The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that can be sent its way?

The optimizer has many types of strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? IBM does not publish the exact details and logic used by the optimizer, but the optimizer is cost-based. This means that the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the DB2 optimizer evaluates and weighs four factors for each potential access path: the CPU cost, the I/O cost, the DB2 Catalog statistics, and the SQL statement itself.

CPU Cost

The optimizer tries to determine the cost of execution of each access path strategy for the query being optimized. Based on the serial number of the CPU, the optimizer estimates the CPU time required to accomplish the tasks associated with the access path it is analyzing. As it calculates this cost, it determines the costs involved in applying predicates, traversing pages (index and tablespace), and sorting.

I/O Cost

The optimizer estimates the cost of physically retrieving and writing the data. In so doing, the optimizer estimates the cost of I/O by using a series of formulas based on the following data: DB2 Catalog statistics, the size of the buffer pools, and the cost of work files used (sorting, intermediate results, and so on). These formulas result in a filter factor, which determines the relative I/O cost of the query. Filter factors are covered in more detail in the "Filter Factors" section, later in this chapter.

DB2 Catalog Statistics

Without the statistics stored in the DB2 Catalog, the optimizer would have a difficult time optimizing anything. These statistics provide the optimizer with information pertinent to the state of the tables that will be accessed by the SQL statement that is being optimized. A complete listing of the DB2 Catalog statistics and values used by the optimizer is in Table 21.1. Partition-level statistics are used when determining the degree of parallelism for queries using I/O, CP, and Sysplex parallelism.

Table 21.1. DB2 Catalog Columns Analyzed by the Optimizer

Catalog Table

Column

Description

SYSIBM.SYSTABLES

CARDF

Number of rows for the table

NPAGES

Number of pages used by the table

EDPROC

Name of the EDITPROC exit routine, if any

PCTROWCOMP

Percentage of active rows compressed for this table

SYSIBM.SYSTABSTATS

CARDF

Number of rows for the partition

NPAGES

Number of pages on which rows of the partition appear

SYSIBM.SYSTABLESPACE

NACTIVEF

Number of allocated, active tablespace pages

SYSIBM.SYSCOLUMNS

LOW2KEY

Second lowest value for the column (first 2,000 bytes only)

HIGH2KEY

Second highest value for the column (first 2,000 bytes only)

COLCARDF

Estimated number of distinct values for the column

SYSIBM.SYSINDEXES

CLUSTERRATIOF

Percentage (multiplied by 100) of rows in clustering order

CLUSTERING

Whether CLUSTER YES was specified when the index was created

FIRSTKEYCARDF

Number of distinct values for the first column of the index key

FULLKEYCARDF

Number of distinct values for the full index key

NLEAF

Number of active leaf pages

NLEVELS

Number of index b-tree levels

SYSIBM.SYSINDEXPART

LIMITKEY

The limit key of the partition

SYSIBM.SYSCOLDIST

TYPE

Type of RUNSTATS gathered; frequent value ( F ) or cardinality ( C )

COLVALUE

Non-uniform distribution column value

FREQUENCYF

Percentage (multiplied by 100) of rows that contain the value indicated in the COLVALUE column

CARDF

Number of distinct values for the column

COLGROUPCOLNO

Set of columns for the statistics gathered

NUMCOLUMNS

The number of columns associated with the statistics

SYSIBM.SYSROUTINES

IOS_PER_INVOC

Estimated number of I/Os per invocation of this routine

INSTS_PER_INVOC

Estimated number of instructions per invocation of this routine

INITIAL_IOS

Estimated number of I/Os for the first and last time the routine is invoked

INITIAL_INSTS

Estimated number of instructions for the first and last time the routine is invoked

CARDINALITY

Predicted cardinality for a table function


The columns that have an F at the end of their names were changed from INTEGER columns to FLOAT columns (as of DB2 V5). This enabled DB2 to store larger values in these columns. The largest value that can be stored in an INTEGER column is 2,147,483,647 ; a floating point column can store values up to 7.2 x 10 75 ”a very large number indeed.

RUNSTATS can also keep track of correlated columns. Correlated columns have values that are related to one another. An example of a set of correlated columns is CITY , STATE , and ZIP_CODE . For example, the combination of CHICAGO for CITY and IL for STATE is much more likely to occur than CHICAGO and AK . Such correlations are useful for the optimizer to analyze when formulating access paths.

The RUNSTATS utility also generates information about the frequency of data values. This information is useful to the optimizer because a different access path may be chosen for a predicate when retrieving a value that occurs 40% of the time than for the same predicate when retrieving a value that occurs only 2% of the time.

The SQL Statement

The formulation of the SQL statement also enters into the access path decisions made by the optimizer. The complexity of the query, the number and type of predicates used (Stage 1 versus Stage 2), the usage of column and scalar functions, and the presence of ordering clauses ( ORDER BY , GROUP BY , and DISTINCT ) enter into the estimated cost that is calculated by the optimizer.

Additionally, certain SQL formulations are recognized as optimizer tweaks ”for example, adding OR 0=1 to a predicate to affect indexing.

 <  Day Day Up  >  


DB2 Developers Guide
DB2 Developers Guide (5th Edition)
ISBN: 0672326132
EAN: 2147483647
Year: 2004
Pages: 388

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net