If you want to know how DB2 will execute a query, you must analyze its access path, which is the method for retrieving data from a specific table or set of tables. Oftentimes, unexpected decreases in performance, such as excessive GETPAGEs, can be attributed to a change in the access path. The Explain facility provides information about how DB2 accesses the data to resolve the SQL statements.
You should have a high-level understanding of how SQL statements are processed by the DB2 database engine: DB2 analyses each SQL statement and then determines how to process it during a static bind or when executed dynamically. The method used to retrieve data from tables is called the access plan.
The DB2 component that determines the access plan to be used is known as the optimizer. During the static preparation of an SQL statement, the SQL compiler is called on to generate an access plan. The access plan contains the data-access strategy, including index usage, sort methods, locking semantics, and join methods.
The executable form of the SQL statement is stored in the system tables when a BIND command is executed. When DB2 processes a program that contains embedded SQL statements, those statements are extracted and placed in a member of a partitioned data set (PDS). This member, called a DBRM (database request module), is the primary input to the BIND command, which can then be bound into a package. For more on packages, refer to Chapter 11.
Sometimes, the complete statement is not known during application development. In this case, the compiler is invoked during program execution to generate a query access plan that can be used by the database manager to access the data. Such an SQL statement is called a dynamic SQL statement. The access plans for a dynamic SQL statement are not stored in the system catalogs but can be cached in memory in the dynamic SQL cache and will not be reprepared if the access plans for the dynamic SQL statements already exist in the dynamic SQL cache.
Explain is a monitoring tool that produces information about the following:
Gathering Explain Data
Populating the PLAN_TABLE can be done in three ways.
Before you can use EXPLAIN, you must create the PLAN_TABLE to hold the results of EXPLAIN. A copy of the statements needed to create the table is in the DB2 sample library under the member name DSNTESC. (Unless you need the information the statements provide, it is not necessary to create a function table or a statement table to use EXPLAIN.)
DB2 does not automatically delete rows from the PLAN_TABLE. To clear the table of obsolete rows, use DELETE, just as you would for deleting rows from any table.
Explain will populate this table; from there, you must interpret the data and act accordingly. Table 17-1 shows the columns in PLAN_TABLE.
This section explains how to examine some of the most useful data in the PLAN_TABLE and determine what the DB2 optimizer is using to access the data.
The following describes the various types of index access and how they are represented in the PLAN_TABLE.
Index Access (ACCESSTYPE is I, I1, N, or M)
If the column ACCESSTYPE in the plan table has a value I, I1, N, or M, DB2 uses an index to access the table named in column TNAME. The columns ACCESSCREATOR and ACCESSNAME identify the index.
Multiple-Index Access (ACCESSTYPE = M)
This value indicates that DB2 uses a set of indexes to access a single table. A set of rows in the plan table contain information about the multiple-index access. The rows are numbered in column MIXOPSEQ in the order of execution of steps in the multiple-index access. (If you retrieve the rows in order by MIXOPSEQ, the result is similar to postfix arithmetic notation.) Additional ACCESSTYPE values for the set of rows describing the multiple-index access further define the type of access: MI for the intersection of multiple indexes, MU for the union of multiple indexes, and MX for an index scan of a named index (in the ACCESSNAME column).
Number of Matching Index Columns (MATCHCOLS = n)
If MATCHCOLS is 0, the access method is called a nonmatching index scan. All the index keys and their row identifiers are read. If MATCHCOLS is greater than 0, the access method is called a matching-index scan: The query uses predicates that match the index columns.
In general, the matching predicates on the leading index columns are equal or IN predicates. The predicate that matches the final index column can be an equal, IN, or range predicate (<, <=, >, >=, LIKE, or BETWEEN). The following example illustrates matching predicates:
SELECT * FROM EMP WHERE JOBCODE = '5' AND LOCATION ='CA' AND SALARY > 60000 AND AGE > 21; INDEX XEMP5 on (JOBCODE,LOCATION,SALARY,AGE)
The index XEMP5 is the chosen access path for this query, with MATCHCOLS = 3. Two equal predicates are on the first two columns, and a range predicate is on the third column. Although the index has four columns, only three of them can be considered matching columns.
At most, one IN predicate can be a matching predicate; the exception, however, is a noncorrelated IN subquery. IN-list predicates cannot be matching predicates for MX access or list prefetch.
Increasing the number of matching columns can help query performance; however, if the column that is added is highly correlated to the existing columns, it may not provide much of a benefit in terms of filtering.
Only Boolean predicates can use matching-index access on a single index.
Index-Only Access (INDEXONLY = Y)
In this case, the method is called index-only access. For a SELECT operation, all the columns needed for the query can be found in the index, and DB2 does not access the table. For an UPDATE or DELETE operation, only the index is required to read the selected row.
Index-only access is not possible when returning varying-length data in the result set or when a VARCHAR column has a LIKE predicate, unless the VARCHAR FROM INDEX field of installation panel DSNTIP4 is set to YES, and plan or packages have been rebound to pick up the change. Also, as of version 8, if the index (with a VARCHAR) is created or altered with the NOT PADDED keyword, it can also be used for index-only access.
If access is by more than one index, INDEXONLY is Y for a step with access type MX. The data pages are not accessed until all the steps for intersection (MI) or union (MU) take place.
When an SQL application uses index-only access for a ROWID column, the application claims the table space or table space partition. As a result, contention may occur between the SQL application and a utility that drains the table space or partition. Index-only access to a table for a ROWID column is not possible if the associated table space or partition is in an incompatible restrictive state. For example, an SQL application can make a read claim on the table space only if the restrictive state allows readers.
The following information describes the various types of table accesses and how they are represented in the PLAN_TABLE.
Table Space Scans (ACCESSTYPE=R, PREFETCH = S)
Table space scan (R = relational scan) is most often used for one of the following reasons.
Assume that table DB2USER1.CANDIDATE has no index on the column identifier (CID). The following is an example that uses a table space scan:
SELECT * FROM DB2USER1.CANDIDATE WHERE CID = :CIDHV;
In this case, at least every row in the CANDIDATE table must be examined to determine whether the value of CID matches the given value.
Prefetching (PREFETCH = L, S, D, or blank)
Prefetching is a method of determining in advance that a set of data pages is about to be used and then reading the entire set into a buffer with a single asynchronous I/O operation. If the value of PREFETCH is S, the method is called sequential prefetch. The data pages that are read in advance are sequential. A table space scan always uses sequential prefetch. An index scan might not use it.
If the value of PREFETCH is L, the method is called list prefetch. One or more indexes are used to select the RIDs for a list of data pages to be read in advance; the pages need not be sequential. Usually, the RIDs are sorted. The exception is the case of a hybrid join (METHOD = 4) when the value of column SORTN_JOIN is N.
If the value of PREFETCH is D, the optimizer expects dynamic prefetch.
If the value of PREFETCH is blank, prefetching is not chosen as an access method. However, depending on the pattern of the page access, data can be prefetched at execution time through a process called sequential detection, also know as dynamic prefetch.
DB2 always attempts to use sequential prefetch for a table space scan. For a segmented table space, if DB2 determines that fewer than four pages will be read at runtime, sequential prefetch is disabled. The OPTIMIZE FOR 1 ROW also potentially disables sequential and list prefetch at bind time.
Limited Partition Scanning (PAGE_RANGE=Y)
DB2 can limit the number of partitions scanned for data access. The query must provide the leading column of the partitioning key. The following example would limit the search for the name of the candidate by providing the high and low numbers of the candidates' CID, therefore limiting the number of partitions to be scanned, assuming that the limit key for the partitions is on CID:
SELECT NAME FROM CANDIDATE WHERE CID BETWEEN :low AND :high
SORT (SORTN_ and SORTC_)
The plan table shows the reasons a sort was invoked. Those reasons could include a sort of data rows or a sort of RIDs in a RID list. SORTN_JOIN indicates that the new table of a join is sorted before the join. (For a hybrid join, this is a sort of the RID list.) When SORTN_JOIN and SORTC_JOIN are both Y, two sorts are performed for the join. The sorts for joins are indicated on the same row as the new table access.
A sort of the composite table for a join (SORTC_JOIN) is beneficial in avoiding death by random I/O. Providing a cluster ratio for an index and keycard statistics for multicolumn cardinalities gives DB2 the information to determine whether a query will suffer from excessive synchronous I/O. In this case, sorting the composite for a nested loop or sorting both, if required, for a sort merge or a hybrid join can provide sequential access to the data.
SORTC_UNIQ indicates a sort to remove duplicates, as might be needed by a SELECT statement with DISTINCT or UNION. SORTC_ORDERBY usually indicates a sort for an ORDER BY clause. But SORTC_UNIQ and SORTC_ORDERBY also indicate when the results of a noncorrelated subquery are sorted, both to remove duplicates and to order the results. A SORTC_GROUPBY would indicate a sort for processing a GROUP BY clause.
If more than one SORTC indicator is set to Y on the same Explain output line, DB2 is performing one sort to accomplish two tasks, such as sorting for uniqueness and ordering. This does not apply to SORTN_JOIN and SORTC_JOIN, as one sort is for the composite and one is for the new table.
To perform list prefetch, DB2 sorts RIDs into ascending page number order. A RID sort is usually not indicated in the PLAN_TABLE, but a RID sort normally is performed whenever list prefetch is used. The only exception to this rule is when a hybrid join is performed and a single, highly clustered index is used on the inner table. In this case, SORTN_JOIN is N, indicating that the RID list for the inner table was not sorted.
SORTN_GROUPBY, SORTN_ORDERBY, and SORTN_UNIQ are not used by DB2.
Nested Loop Join (METHOD = 1)
For a nested loop join, DB2 scans the composite, or outer, table. For each table row that qualifies, by satisfying the predicates on that table, DB2 searches for matching rows of the new, or inner, table, concatenating any it finds with the current row of the composite table. If no rows match the current row:
Stage 1 and stage 2 predicates can eliminate unqualified rows before the physical joining of rows occurs. Nested loop join is often used if
The nested-loop join repetitively scans the inner table. That is, DB2 scans the outer table once and scans the inner table as many times as the number of qualifying rows in the outer table. Hence, the nested loop join is usually the most efficient join method when the values of the join column passed to the inner table are in sequence and the index on the join column of the inner table is clustered, or the number of rows retrieved in the inner table through the index is small. If the tables are not clustered in the same sequence, DB2 can sort the composite to match the sequence of the inner table. Accesses to the inner table can then use sequential or dynamic prefetch.
Merge-Scan Join (METHOD = 2)
The merge-scan join is also known as a merge join, or sort-merge join. This method must have one or more predicates of the form TABLE1.COL1 = TABLE2.COL2, where the two columns have the same data type, length, and null attributes. If the null attributes do not match, the maximum number of merge-join columns is 1. The exception is a full outer join, which permits mismatching null attributes.
Join columns cannot be matching columns for a merge-scan join. Instead, these columns are listed as MERGE_JOIN_COLS in the plan table. Local predicates can be matching predicates and will be applied before the join.
DB2 scans both tables in the order of the join columns. If no efficient indexes on the join columns provide the order, DB2 might sort the outer table, the inner table, or both. The inner table is put into a work file; the outer table is put into a work file only if it must be sorted. When a row of the outer table matches a row of the inner table, DB2 returns the combined rows.
A merge-scan join is often used if
A merge-scan join is always chosen for a full outer join.
Hybrid Join (METHOD = 4)
The method applies only to an inner join and requires an index on the join column of the inner table. The method requires obtaining RIDs in the order needed to use list prefetch. In the successive steps, DB2
A hybrid join is often used if
DB2 may also choose to sort the composite for a hybrid join.
Star Join (METHOD = 0, 1; JOIN TYPE = S)
Star join is the access path used in processing a star schema, a logical database design that is included in many data warehouse and decision-support applications. A star schema is composed of a fact table and a number of dimension tables that are connected to it. A dimension table contains several values that are given an ID, which is used in the fact table instead of all the values. You can think of the fact table, which is much larger than the dimension tables, as being in the center, surrounded by dimension tables; the result resembles a star formation.
To access the data in a star schema, you write SELECT statements that include join operations between the fact table and the dimension tables; no join operations exist between dimension tables. A query must satisfy a number of conditions before it qualifies for the star join access path. The first requirement is detection of the fact table. Given that the access-path objective is efficient access to the fact table, it is important that the fact table be correctly identified.
The first fact-table detection algorithm is known as the unique-index check. Beginning outside-in, the optimizer will evaluate each set of join predicates. For each set of join predicates between two tables, the table with a unique index on the join predicates is considered to be the parent in a parent/child relationship. As DB2 continues outside-in, the table without any further childrenand that therefore has only parentsis considered to be the fact table.
The second fact-table detection algorithm is based on the values of the STARJOIN DSNZPARM, which are
The third fact-table detection algorithm is the topology check. The fact table is considered to be the table with the most number of join predicates in the query.
Once a fact table is identified by using any of the three fact-table detection algorithms, the following conditions must be met for DB2 to use the star join technique:
A successful match on all of the star schema detection rules will immediately qualify the query for star join optimization. A failure on any of those rules for a fact table will result in the next fact-table detection algorithm to be evaluated. A failure of these rules for all the fact-table detection rules will result in the query's being optimized, using standard dynamic programmingexhaustive searchtechniques or algorithms.
When a star join is performed, it is identified by an S in the JOIN_TYPE column of the PLAN_TABLE for dimension tables that are accessed before the fact table.
For a star schema, even though the intersection of all dimensions with the fact table can produce a small result set, the predicates applied to one single dimension table are typically insufficient to reduce the enormous number of fact-table rows.
If a join based on related tablesdimension table to fact tabledoes not provide adequate performance, an alternative is to join unrelated tables. Joining unrelated tables results in a Cartesian product, whereby every row of the first table is joined with every row of the second table.
Performing a Cartesian join of all dimension tables before accessing the fact table may not be efficient. DB2 must decide how many dimension tables should be accessed first to provide the greatest level of filtering of fact-table rows, using available indexes. This can be a delicate balance, as further Cartesian products will produce a massive increase in the size of the intermediate result sets. Alternatively, minimal prejoining of unrelated dimension tables may not provide adequate filtering for the join to the fact table. For an efficient Cartesian process, DB2 uses a logical rather than a physical Cartesian of the dimension tables. Each dimension or snowflakefurther normalized dimensioncovered by the chosen fact-table index is accessed independently before the fact table. Each qualifying dimension and snowflake has all local predicates applied, with the result sorted into join-column order and, finally, materialized into its own separate work file. If many of the dimensions involve snowflakes, this preprocessing and materialization significantly reduce the number of overall tables joined, as the snowflake is resolved into a single dimension.
Rather than requiring the physical work file storage involved in a physical Cartesian, DB2 simulates a Cartesian by repositioning itself within each work file to potentially join all possible combinations to the central fact table. The sequence of this simulated Cartesian join respects the column order of the selected fact-table index.
The sparseness of data within the fact table implies that a significant number of values generated by the Cartesian process are not to be found by a join to the fact table. To minimize the CPU overhead of joining unnecessarily derived rows to the fact table, DB2 introduces an index-key feedback loop to return the next-highest key value whenever a not-found condition is encountered.
A hit on the fact-table index will return the matching fact-table row. A miss will return the next valid fact-table index key so that the data manager can reposition itself within the dimension work files, thus skipping composite rows with no possibility of obtaining a fact-table match.
To further improve the performance of the join to the fact table, the entire join has been pushed down to data manager (stage 1), but this applies only for star join access from the composite (dimensions) to the fact table. This ensures a reduced path length, as rows no longer need to be returned to RDS (stage 2) for the join to occur. The join method used by this process is a nested loop join.
To help the performance of star join queries the following can be considered: promoting the use of the dynamic SQL cache, increasing the size of the work file pool for star joins via the SJMXPOOL DSNZPARM, increasing the number of parallel degrees with CURRENT DEGREE, or considering the use of access-path hints.
Parallelism Usage (PARALLELISM_MODE = I, C, or X)
Parallel processing applies only to read-only queries. The values are
Non-null values in columns ACCESS_DEGREE and JOIN_DEGREE indicate to what degreenumber of concurrent processesDB2 plans to use parallel operations. However, this can change at runtime.
If it exists, this table is populated when EXPLAIN is run. Some of the columns are similar to those in the PLAN_TABLE, and some are new and related only to the statement cost. The columns that are unique to the DSN_STATEMNT_TABLE are
The COST_CATEGORY determination is affected by such things as the number of tables, the number of rows in the tables, column cardinality, cluster ratio, first key cardinality, full key cardinality, the number leaf pages, the number of index levels, host variables, special registers, triggers, UDFs, RI, LOBs, and expressions. Table 17-2 describes the contents of the DSN_STATEMNT_TABLE, and Table 17-3 describes the contents of the DSN_FUNCTION_TABLE.
Access Path Hints
DB2 provides a facility for telling it how to process a query. This is done by giving DB2 hints. The process of giving hints to DB2 is relatively simple, but determining what those hints should be is not. Giving optimization hints to DB2 is useful in the following situations:
The facility for implementing access-path hints requires many tasks to be performed, as well as making sure that the subsystem has been enabled for hints. Some of the items that will be affected by using hints are
Even after a hint has been established, it is not always possible for the optimizer to use it. As part of the normal bind process, the optimizer will have to evaluate the hint and determine whether it is valid. Processes will have to be in place to make sure that rebinds do not change the process. For this reason, hints should not be used unless all else fails in establishing an acceptable access path.
Using Statistics to Model a Production Environment
The optimizer uses statistics in the catalog in calculating the costs of SQL to determine the best access path to pick. In building a test environment, it is often helpful to copy statistics from a production environment or to update the statistics to an approximation of what production might be like. Table 17-4 lists the catalog statistics that can be updated and are used by the optimizer in access-path selection.
Guidelines for Using Explain Output
Analyzing the Explain data can help you to tune your queries and environment in a number of ways. For example, are indexes being used? Creating appropriate indexes can have a significant positive impact on performance. You can use the Explain output to determine whether the indexes you have created to help a specific set of queries are being used. In the Explain output, you should look for index usage in the following areas:
You can also use Explain to evaluate whether a different index can be used instead of an existing index or no index at all. After creating a new index, collect statistics for that index, using the RUNSTATS utility, and rebind the query.
Collecting additional statistics, including statistics on non-key columns, can enhance the optimizer's access-path selection for queries involving joins.
Over time, you may notice, through the Explain data, that instead of an index scan, a table scan is now being used. This can result from a change in the clustering of the table data. If the index that was previously being used now has a low cluster ratio, you want to do the following.
The more indexes that are created, the more importance is placed on accurate and detailed statistics. DB2 uses these statistics to differentiate among the indexes for access-path selection. Increasing the number of indexes increases the potential access paths that DB2 must evaluate. This may lead to a less efficient access path being selected if statistics do not adequately distinguish each index for DB2.
Is the type of access appropriate for the application? You can analyze the Explain output and look for types of data access that, as a rule, are not optimal for the type of application being executed. For example, OLTP applications are prime candidates to use matching-index scans with range-delimiting predicates; these applications tend to return only a few rows in their queries, using an equality predicate against a key column. If your OLTP queries are using a table scan, nonmatching-index scan, or filtering on too few columns, you may want to analyze the Explain data and index statistics to determine why a matching-index scan was not used.
Similarly, the search criteria for a read-only query may be vague, causing a large number of rows to qualify. A user who usually looks at only a few screens of the output data may want to try to ensure that the entire answer set need not be computed before some results are returned. In this case, the goals of the user are different from the basic operating principle of the optimizer, which attempts to minimize resource consumption for the entire query, not just the first few screens of data.
For example, if the Explain output shows that both merge-scan join and sort operators were used in the access plan, the entire answer set will be materialized in a temporary table before any rows are returned to the application. In this case, you can attempt to change the access plan by using the OPTIMIZE FOR clause on the SELECT statement. The value specified for the OPTIMIZE clause should represent the number of rows to be processed by the application. In this way, the optimizer can attempt to choose an access plan that does not produce the entire answer set in a temporary table before returning the first rows to the application.
Visual Explain is a GUI utility that enables the database administrator or application developer to examine the access plan determined by the optimizer. The utility provides
Visual Explain can be used to analyze previously generated Explains or to gather Explain data and Explain dynamic SQL statements. It will also allow you to generate reports and have the analysis output saved as an XML document.