How the Optimizer Works | DB2 Developers Guide (5th Edition)

< Day Day Up >

The optimizer performs complex calculations based on a host of information. To simplify the functionality of the optimizer, you can picture it as performing a four-step process:

Receive and verify the SQL statement.
Analyze the environment and optimize the method of satisfying the SQL statement.
Create machine-readable instructions to execute the optimized SQL.
Execute the instructions or store them for future execution.

The second step of this process is the most intriguing. How does the optimizer decide how to execute the vast array of SQL statements that can be sent its way?

The optimizer has many types of strategies for optimizing SQL. How does it choose which of these strategies to use in the optimized access paths? IBM does not publish the exact details and logic used by the optimizer, but the optimizer is cost-based. This means that the optimizer will always attempt to formulate an access path for each query that reduces overall cost. To accomplish this, the DB2 optimizer evaluates and weighs four factors for each potential access path: the CPU cost, the I/O cost, the DB2 Catalog statistics, and the SQL statement itself.

CPU Cost

The optimizer tries to determine the cost of execution of each access path strategy for the query being optimized. Based on the serial number of the CPU, the optimizer estimates the CPU time required to accomplish the tasks associated with the access path it is analyzing. As it calculates this cost, it determines the costs involved in applying predicates, traversing pages (index and tablespace), and sorting.

I/O Cost

The optimizer estimates the cost of physically retrieving and writing the data. In so doing, the optimizer estimates the cost of I/O by using a series of formulas based on the following data: DB2 Catalog statistics, the size of the buffer pools, and the cost of work files used (sorting, intermediate results, and so on). These formulas result in a filter factor, which determines the relative I/O cost of the query. Filter factors are covered in more detail in the "Filter Factors" section, later in this chapter.

DB2 Catalog Statistics

Without the statistics stored in the DB2 Catalog, the optimizer would have a difficult time optimizing anything. These statistics provide the optimizer with information pertinent to the state of the tables that will be accessed by the SQL statement that is being optimized. A complete listing of the DB2 Catalog statistics and values used by the optimizer is in Table 21.1. Partition-level statistics are used when determining the degree of parallelism for queries using I/O, CP, and Sysplex parallelism.

Table 21.1. DB2 Catalog Columns Analyzed by the Optimizer

Catalog Table	Column	Description
`SYSIBM.SYSTABLES`	`CARDF`	Number of rows for the table
	`NPAGES`	Number of pages used by the table
	`EDPROC`	Name of the `EDITPROC` exit routine, if any
	`PCTROWCOMP`	Percentage of active rows compressed for this table
`SYSIBM.SYSTABSTATS`	`CARDF`	Number of rows for the partition
`SYSIBM.SYSTABSTATS`	`NPAGES`	Number of pages on which rows of the partition appear
`SYSIBM.SYSTABLESPACE`	`NACTIVEF`	Number of allocated, active tablespace pages
`SYSIBM.SYSCOLUMNS`	`LOW2KEY`	Second lowest value for the column (first 2,000 bytes only)
	`HIGH2KEY`	Second highest value for the column (first 2,000 bytes only)
	`COLCARDF`	Estimated number of distinct values for the column
`SYSIBM.SYSINDEXES`	`CLUSTERRATIOF`	Percentage (multiplied by 100) of rows in clustering order
	`CLUSTERING`	Whether `CLUSTER YES` was specified when the index was created
	`FIRSTKEYCARDF`	Number of distinct values for the first column of the index key
	`FULLKEYCARDF`	Number of distinct values for the full index key
	`NLEAF`	Number of active leaf pages
	`NLEVELS`	Number of index b-tree levels
`SYSIBM.SYSINDEXPART`	`LIMITKEY`	The limit key of the partition
`SYSIBM.SYSCOLDIST`	`TYPE`	Type of `RUNSTATS` gathered; frequent value ( `F` ) or cardinality ( `C` )
	`COLVALUE`	Non-uniform distribution column value
	`FREQUENCYF`	Percentage (multiplied by 100) of rows that contain the value indicated in the `COLVALUE` column
	`CARDF`	Number of distinct values for the column
	`COLGROUPCOLNO`	Set of columns for the statistics gathered
	`NUMCOLUMNS`	The number of columns associated with the statistics
`SYSIBM.SYSROUTINES`	`IOS_PER_INVOC`	Estimated number of I/Os per invocation of this routine
	`INSTS_PER_INVOC`	Estimated number of instructions per invocation of this routine
	`INITIAL_IOS`	Estimated number of I/Os for the first and last time the routine is invoked
	`INITIAL_INSTS`	Estimated number of instructions for the first and last time the routine is invoked
	`CARDINALITY`	Predicted cardinality for a table function

The columns that have an F at the end of their names were changed from INTEGER columns to FLOAT columns (as of DB2 V5). This enabled DB2 to store larger values in these columns. The largest value that can be stored in an INTEGER column is 2,147,483,647 ; a floating point column can store values up to 7.2 x 10 ⁷⁵ ”a very large number indeed.

RUNSTATS can also keep track of correlated columns. Correlated columns have values that are related to one another. An example of a set of correlated columns is CITY , STATE , and ZIP_CODE . For example, the combination of CHICAGO for CITY and IL for STATE is much more likely to occur than CHICAGO and AK . Such correlations are useful for the optimizer to analyze when formulating access paths.

The RUNSTATS utility also generates information about the frequency of data values. This information is useful to the optimizer because a different access path may be chosen for a predicate when retrieving a value that occurs 40% of the time than for the same predicate when retrieving a value that occurs only 2% of the time.

The SQL Statement

The formulation of the SQL statement also enters into the access path decisions made by the optimizer. The complexity of the query, the number and type of predicates used (Stage 1 versus Stage 2), the usage of column and scalar functions, and the presence of ordering clauses ( ORDER BY , GROUP BY , and DISTINCT ) enter into the estimated cost that is calculated by the optimizer.

Additionally, certain SQL formulations are recognized as optimizer tweaks ”for example, adding OR 0=1 to a predicate to affect indexing.

< Day Day Up >