Designing Indexes and SQL for Performance | DB2 for z/OS Version 8 DBA Certification Guide

Indexes provide various access paths. You need to ensure that indexes are in place to provide the best access path possible for all critical queries.

Indexes for Efficient Access

DB2 uses the following index access paths:

Matching-index scan (MATCHCOLS > 0)
Index screening
Nonmatching-index scan (ACCESSTYPE = I and MATCHCOLS = 0)
IN-list index scan (ACCESSTYPE = N)
Multiple-index access (ACCESSTYPE is M, MX, MI, or MU)
One-fetch access (ACCESSTYPE = I1)
Index-only access (INDEXONLY = Y)
Equal unique index (MATCHCOLS = number of index columns)

Matching-index scan (MATCHCOLS > 0)

In a matching-index scan, predicates are specified on either the leading index-key columns or all of them. These predicates provide filtering; only specific index pages and data pages need to be accessed. If the degree of filtering, or cardinality, is high, the matching-index scan is efficient.

Index-matching predicates are applied in the sequence of the index columns. Therefore, coding sequence is not important for index-matching predicates. DB2 evaluates WHERE clause predicates, based on the following sequence:

Indexed predicates: (a) matching predicates and (b) index-screening predicates
Stage 1 predicates
Stage 2 predicates

Excluding index-matching predicates, within each stagescreening, stage 1, or stage 2predicates are generally applied in the following sequence:

All equal predicates (and IS NULL)
All range predicates (and IS NOT NULL)
All other predicates

The final rule for predicate evaluation dictates that within each of the preceding guidelines, predicates are evaluated in the sequence they are coded. This gives some control: therefore, the programmer should code the most restrictive predicates first.

For more information on stage 1 and stage 2 predicates, refer to Chapter 4.

Index Screening

In index screening, predicates are specified on index-key columns but are not part of the matching columns. Those predicates improve subsequent data page access by reducing the number of rows that qualify while searching the index. For example, with an index on T (C1, C2, C3, C4) in the following SQL statement, C3 > 0 and C4 = 2 are index-screening predicates, whereas C1 = 1 is a matching-index predicate:

 SELECT *FROM T WHERE C1 =1 AND C3 >0 AND C4 =2 AND C5 =8;

Nonmatching-Index Scan (ACCESSTYPE = I and MATCHCOLS = 0)

In a nonmatching-index scan, no matching columns are in the index. Hence, all the index keys must be examined. Because a nonmatching-index scan usually does not provide strong filtering, only a few cases provide an efficient access path if subsequent data pages must also be accessed. If the access path is index-only, a nonmatching-index scan may prove beneficial, especially if the index is smaller than the table space.

NOTE

An index-only nonmatching-index scan may not always be more efficient than a table space scan. A table space may be smaller than an index if the number of index keys is large or the table space is compressed. Also, assuming that both are of similar size, an index scan may be less efficient because the scan must follow the leaf page pointer chain, which may not be sequential, owing to index page splits.

IN-list index scan (ACCESSTYPE = N)

An IN-list index scan is a special case of the matching-index scan; a single indexable IN predicate is used as a matching-equal predicate. You can regard the IN-list index scan as a series of matching-index scans with the values in the IN predicate being used for each matching-index scan. The following example has an index on (C1, C2, C3, C4) and might use an IN-list index scan:

 SELECT *FROM T WHERE C1=1 AND C2 IN (1,2,3) AND C3>0 AND C4<100;

This example could result in an ACCESSTYPE = N and MATCHCOLS = 3 (C1, C2, C3). C4 would be an index-screening predicate.

NOTE

At most, one IN predicate can be a matching predicate; the exception, however, is a noncorrelated IN subquery. IN-list predicates cannot be matching predicates for MX access or list prefetch.

Multiple-index access (ACCESSTYPE is M, MX, MI, or MU)

Multiple-index access uses more than one index to access a table. It is a good access path when no single index provides efficient access and/or a combination of index accesses does provide efficient access.

RID lists are constructed for each of the indexes involved. The unions (OR conditions) or intersections (AND conditions) of the RID lists produce a final list of qualified RIDs; this list is used to retrieve the result rows, using list prefetch. You can consider multiple-index access as an extension to list prefetch, with more complex RID-retrieval operations in its first phase. The complex operators are union and intersection. DB2 may choose multiple-index access for the following query:

 SELECT * FROM EMP WHERE(AGE=34) OR (JOB='MANAGER');

For this query:

EMP is a table with columns EMPNO, EMPNAME, DEPT, JOB, AGE, and SAL.
EMPX1 is an index on EMP with key column AGE.
EMPX2 is an index on EMP with key column JOB.

One-fetch access (ACCESSTYPE = I1)

One-fetch index access requires retrieving only one row. It is the best possible access path and is chosen whenever it is available. It applies to a statement with a MIN or MAX column function: The order of the index allows a single row to give the result of the function. Either an ascending or a descending index can be used to satisfy a MIN or MAX function using one-fetch index access.

Index-only access (INDEXONLY = Y)

With index-only access, the access path does not require any data pages, because the access information is available in the index. Conversely, when an SQL statement requests a column that is not in the index, updates any column in the table, or deletes a row, DB2 has to access the associated data pages. Because the index is generally smaller than the table itself, an index-only access path usually processes the data efficiently.

NOTE

The number of levels of the index can determine whether index-only access will be chosen instead of index and data access via a different index. Assume that IX1 has four levels and provides an index-only access path and that IX2 has three levels and is not index-only. DB2 may choose IX2 and access the data pages, as both access paths result in four I/Os.

Equal Unique Index (MATCHCOLS = Number of Index Columns)

An index that is fully matched and unique and in which all matching predicates are equal predicates is called an equal unique index case. This case guarantees that only one row is retrieved. If no one-fetch index access is available, this is considered the most efficient access over all other indexes that are not equal unique. (The uniqueness of an index is determined by whether it was defined as unique.) Sometimes, DB2 can determine that an index that is not fully matching is an equal unique index case. This is based on the existence of another unique index with a subset of the key columns.

Indexes to Help Avoid Sorts

As well as providing selective access to data, indexes can order data, sometimes eliminating the need for sorting. Some sorts can be avoided if index keys are in the order needed by ORDER BY, GROUP BY, a join operation, or DISTINCT in a column function. A DISTINCT sort can also be avoided if a unique index exists on the selected columns and/or WHERE clause columns. In other cases, as when list prefetch is used, the index does not provide useful ordering, and the selected data might have to be sorted.

When it is absolutely necessary to prevent a sort, consider creating an index on the column or columns necessary to provide that ordering. Consider also using the clause OPTIMIZE FOR 1 ROW to discourage DB2 from choosing a sort for the access path. Consider the following query:

 SELECT C1,C2,C3 FROM T WHERE C1 >1 ORDER BY C1 OPTIMIZE FOR 1 ROW;

An ascending index on C1 or an index on (C1, C2, C3) could eliminate a sort.

NOTE

The OPTIMIZE FOR 1 ROW has implications other than avoiding a sort and therefore should be used only when necessary. Consider specifying a value that represents the number of rows that are required to be processed by the application.

Not all sorts are inefficient. For example, if the index that provides ordering is not an efficient one and many rows qualify, using another access path to retrieve and then sort the data might be more efficient than the inefficient, ordering index.

As of version 8, DB2 can backward scan an index and avoid a sort on queries that can use this. This backward scan will work if the index is the exact opposite of the queries. DB2 will be able to avoid a sort in both of the following examples:

 Index: COL1 ASC, COL2 DESC SELECT COL1, COL2 FROM T ORDER BY COL1 ASC, COL2 DESC SELECT COL1, COL2 FROM T ORDER BY COL1 DESC, COL2 ASC

Dynamic SQL

Dynamic SQL performance is critical for many applications but a bit more difficult to guarantee and tune. A few items can help you get the best performance for dynamic SQL:

Reorganize the DB2 catalog
Reorganize the application table space and indexes
Use dynamic statement caching
SET CURRENT DEGREE = 'ANY' to attempt parallelism

The dynamic SQL cache is important to the performance of dynamic SQL so that the statements do not have to be reprepared. For more information on monitoring and tuning the dynamic SQL cache, refer to Chapter 18. In order to pick up new statistics for objects used in the dynamic SQL, the dynamic SQL cache needs to be refreshed. This is done by executing RUNSTATS on the objects on which the query is dependent. This can also be done by using RUNSTATS NONE REPORT NO if you wish not to update catalog statistics or report on changes.

Query Parallelism

In order to reduce elapsed time for a query, DB2 can provide a query with parallel resources, such as several I/O paths or processors. By taking advantage of these resources, queries can run in a shorter period of time, allowing more work to be pushed through the system. Parallelism can help improve the performance of I/O and CPU-bound read-only queries and can help queries that are reading large amounts of data, regardless of the filtration.

Some overhead is associated with the use of parallelism in terms of CPU. DB2 scales processor-intensive work across all available processors. Parallelism can average less than 1 percent additional CPU overhead for long-running queries and less than 10 percent for short-running queries.

I/O and CPU Parallelism (PARALLELISM_MODE = I or O)

DB2 can use two methods for achieving query parallelism: I/O or CPU. With I/O parallelism, the goal is to move elapsed time toward CPU time by splitting data access into equal, sequential prefetch streams to bring I/O time down to estimated CPU time. If CPU is estimated at 1 second, and I/O at 3 seconds, the three I/O parallel streams of approximately equal size will be started. Each I/O stream should cost about 1 second. This is implemented with a round-robin type of GET paging. With current releases of DB2, I/O parallelism is infrequently chosen; the preferred method is CPU parallelism.

DB2 splits queries into equal multiple smaller queries and processing those queries in multiple execution units, or parallel tasks. At execution time, DB2 will take into consideration the number of CPUs available; if not enough CPUs are available to support the degree of parallelism initially chosen by the optimizer, the degree will be degraded.

DB2 parallelism is decided at both bind time and runtime. If parallelism is not chosen at bind time, it cannot be chosen at runtime. Even if it is chosen at bind time, parallelism may not be used at runtime, owing to several factors. If not enough space is in the virtual buffer pool to support the requested degree of parallelism, the degree can be reduced from that chosen at bind time, or parallelism can be turned of altogether. If host variables are used in the SQL query, DB2 may be prevented from determining which partitions will qualify in a query; therefore, the degree chosen for parallelism will be decided at runtime. If DB2 determines that an ambiguous cursor can be updated, parallelism will be disabled. If parallelism is disabled, the query does not fail; DB2 simply uses a sequential plan for access to the data.

During BIND or PREPARE, DB2 chooses the access path best suited for the query and then does a postoptimization step to identify the sections of the access path that will benefit most from parallelism. DB2 then identifies the part of the query that can be executed in parallel and determines the degree of parallelism to be used.

Queries Best Suited for Parallelism

Queries with the following characteristics will be able to take advantage of parallelism:

Long-running, read-only queries, both static and dynamic SQL, from both local and remote sites, and when using either private or DRDA protocols
Table space scans and index scans
Joins
Nested loops
Merge scans
Hybrid, without sort on new tables
Sorts
Aggregate functions

Parallelism will not be considered in only a few places:

Queries that materialize views
Queries that perform materialization because of nested table expressions
Queries performing a merge-scan join of more than one column
Queries using direct row access

The following will cause only sysplex parallelism not to be considered:

Queries with list prefetch and multiple index access
Queries accessing LOB data

Parallelism should not be used if a system is already CPU constrained, because parallelism would only add to the problem in most situations. CPU parallelism cannot be used when a cursor is defined WITH HOLD, as this cursor's use is potentially interrupted by a commit, which causes a stoppage in processing.

Short-running queries are usually not going to see a great benefit from parallelism. (Generally, a short-running query is one that is subsecond.) But how often are long-running queries separated from short-running queries? If you are trying to get the benefits from parallelism without placing unnecessary overhead where it does not belong, you will need to consider this type of granularity of processing.

You have two potential options: You could separate the long-running queries into a separate package and bind it DEGREE(ANY) in a different collection and then use the SET CURRENTPACKAGESET statement to switch between it and a program bound with DEGREE(1) for shorter queries that are better to let run sequentially. Or you could set the macro SPRMPTH to disable parallelism for short-running queries. The default for this value is 120; thus, any query with an estimated cost of less than 120 milliseconds will have parallelism disabled. Parallelism is enabled for any query above this threshold.

Sysplex Query Parallelism (PARALLELISM_MODE = X)

Sysplex query parallelism works in much the same multitasking way as CPU parallelism and also enables you to take a complex query and run across multiple members in a data sharing group. Sysplex query parallelism is best used with isolation-level UR (uncommitted read) to avoid excess lock propagation.

A query is issued by a coordinator, who sends the query to the assistant members in the group. The data is then processed and returned to the coordinator either by a work filethe coordinator will read each assistant's work filesor by Cross-System Coupling Facility (XCF) links when a work file is not necessary.