Creating Indexes | DB2 Developers Guide (5th Edition)

< Day Day Up >

A proper indexing strategy can be the most important factor to ensure optimal performance of DB2 applications. However, indexing is most likely improperly implemented at most DB2 sites. This is due to the nature of database development. Database objects typically are created near the beginning of a project ”after the logical data model has been transformed into a physical database design, but before any application code or SQL has been written. So the DBA takes an initial best guess at creating some indexes on each table. Of course, indexes are best designed based on how the data will be accessed. Without the SQL, proper indexes cannot be created. Therefore, as the development process progresses an iterative approach is taken to index creation. New indexes are created to support new queries; old indexes are removed if they are not being used. Such an approach is fraught with potential problems, but such is life.

Indexes should be created to match the predicates of your most important and frequently executed SQL statements. When DB2 accesses data using an index, it can be much more efficient than scanning an entire table for the row(s) that apply. For example, consider the following SQL statement:

 SELECT FIRSTNME, MIDINIT, LASTNAME, WORKDEPT,        PHONENO, SALARY FROM   DSN8810.EMP WHERE  EDLEVEL = 2 AND    BONUS > 100.00;

If there are no indexes defined on the EMP table, DB2 will have to search every row of the table, looking for employees with an education level of 2 and a bonus greater than $100.00. But if we define an index on the table ”on either or both of the two columns in the predicates ”then DB2 can use the index to find the qualifying rows. This should reduce I/O and improve performance.

An index on EDLEVEL , BONUS would be the most beneficial for this query. DB2 can match on the equality predicate and scan on the range ( > ) predicate. So, DB2 can use the index to match the value of EDLEVEL and scan for BONUS values.

An index is created using the CREATE INDEX statement, which is similar in several ways to the CREATE TABLESPACE statement. Both require the user to specify storage ( USING , PRIQTY , and SECQTY ), free space ( PCTFREE and FREEPAGE ), a buffer pool ( BUFFERPOOL ), and how to close the underlying data sets ( CLOSE ). However, there are also many differences.

One big difference is that separate CREATE statements are not used to create an index and an index space. An index space is the underlying storage structure for index data and it is automatically created by DB2 whenever an index is created. There can only be one index in an index space.

Of course, there are many other differences because indexes are different from table spaces, and serve different data processing needs.

In DB2, uniqueness is enforced using an index. You can create a unique index that forces the columns specified for the index to be unique within the table. If you try to INSERT or UPDATE these columns with non-unique values, an error code is returned and the request fails. Creating a unique index is the only way to ensure uniqueness for a column (or columns) in DB2.

You can use an index to guide DB2 to control how table space data is physically stored on disk. This is called clustering . A DB2 index is a clustering index if the CLUSTER keyword is specified when the index is created. Clustering causes inserted rows to be stored contiguously in sequence whenever possible. Additionally, when the table space is reorganized the data will be sequenced according to the clustering index. Since there can only be one physical sequence for data on disk, there can only be one clustering index per table. If you do not specify a clustering index DB2 will choose to cluster the data using the oldest existing index. It is wise to explicitly specify a clustering index instead of letting DB2 decide because you will almost always choose better than the (basically random) choice DB2 makes.

Indexes also can be used to control partitioning. Prior to DB2 V8, a partitioning index was the only way to partition data. As of V8, though, partitioning can be specified and controlled in the table DDL. Notwithstanding this separation of partitioning from indexing, an index can be partitioned itself into separate data sets.

Indexed columns can be specified as ascending or descending. This is accomplished by specifying either ASC or DESC after each column in the index. Specifying ASC causes index entries to be in ascending order by the column (this is the default). DESC puts the index entries in descending order by the column.

Finally, through the use of the DEFINE and DEFER parameters DB2 can delay the creation of the index and its data sets until a later time.

Index Considerations

Before creating any indexes, you should analyze your data and consider the following factors:

^[1]

^[1] Actually, if no clustering index is explicitly defined, DB2 uses the index with the lowest OBID to cluster the data. This usually results in the oldest index being used for clustering, but not necessarily because OBIDs can be reused.

Percentage of table access versus table update
Data access patterns, both ad hoc and planned
Amount of data accessed by each query against the table
Impact on recovery
Performance requirements of accessing the table
Performance requirements of modifying the table
Frequency of INSERT , UPDATE , and DELETE operations
Storage requirements
Impact of reorganization
Impact on the LOAD utility

Remember that indexes are created to enhance performance. Although an index may speed up the performance of a query, each new index will degrade data modification. Keep the following in mind as you create indexes:

Consider indexing on columns used in WHERE , GROUP BY , ORDER BY , and UNION ALL clauses.
Limit the indexing of frequently updated columns.
If indexing a table, explicitly create a clustering index. Failure to do so will result in DB2 clustering data by the first index created. If indexes are subsequently dropped and re-created, this can change the clustering sequence if the indexes are created in a different order.
Consider clustering on columns in GROUP BY and ORDER BY specifications to improve sequential access.
If no sorting or grouping is required, analyze your WHERE clauses and cluster on the columns most frequently referenced, or better yet, most frequently run.
Choose the first column of multi-column indexes wisely, based on the following hierarchy. First, choose columns that will be specified most frequently in SQL WHERE clauses (unless cardinality is very low). Second, choose columns that will be referenced most often in ORDER BY and GROUP BY clauses (once again, unless cardinality is very low). Third, choose columns with the highest cardinality.

NOTE

Low cardinality of the first column is not an issue if a filtering value will always be provided in each query using equality, for example, a predicate like WHERE COL = :HOSTVAR or WHERE COL IN ( list ) .
The biggest payback from an index comes from DB2's capability to locate and retrieve referenced data quickly. DB2's capability to do this is reduced when cardinality is low because multiple RIDs satisfy a given reference. Balance the cardinality of a column against the amount of time it is accessed, giving preference to data access over cardinality.
There are no hard and fast rules for index creation. Experiment with different index combinations and gauge the efficiency of the results.
Keep the number of columns in an index to a minimum. If only three columns are needed, index only those three columns. As more columns are added to the index, data modification degrades.
Sometimes, however, it can be advantageous to include additional columns in an index to increase the chances of index-only access. (Index-only access is discussed further in Chapter 21, "The Optimizer.") For example, suppose that there is an index on the DEPTNO column of the DSN8810.DEPT table. The following query may use this index:
```
 
```
```
 SELECT  DEPTNAME FROM    DSN8810.DEPT WHERE   DEPTNO > 'D00'; 
```
DB2 could use the index to access only those columns with a DEPTNO greater than D00 , and then access the data to return the DEPT .

NOTE

A good rule of thumb for index creation is to keep creating indexes to enhance the performance of your queries until the performance of data modification becomes unacceptable. Then, delete the last index you created. This general approach is best described as creating indexes until it hurts.

Indexing Variable Columns

Prior to Version 8, when indexing on a variable column, DB2 automatically pads the variable column out to its maximum size . So, for example, creating an index on a column defined as VARCHAR(50) will cause the index key to be padded out to the full 50 bytes. Padding poses several problems. You cannot get index-only access with a padded index because DB2 will always have to access the table space data to retrieve the actual size of each column. Remember, the size of a variable-length column is stored in a two-byte prefix and this information is not in a padded index. Also, padding very large variable columns can create a very large index with a lot of wasted space.

DB2 V8 offers the capability to direct DB2 as to whether variable columns in an index should be padded. Appropriately enough, a new option, PADDED (or NOT PADDED ) can be specified when creating indexes. The specification is made at the index level, so every variable column in the index will be either padded or not padded.

When PADDED is specified, DB2 will create the index just as it did prior to V8 ”by padding all variable columns to their maximum size. When NOT PADDED is specified, DB2 will treat the columns as variable and you will be able to obtain index-only access. The length information will be stored in the index key.

NOTE

Keep in mind that DB2 cannot perform index-only access using a padded index ”even if every required column exists in the index. This is so because the actual length of the VARCHAR column(s) is not stored in a padded index. So, DB2 will have to access the table to retrieve the length from the two-byte column prefix.

A new DSNZPARM, named PADIX , is provided to control whether the default is PADDED or NOT PADDED when an index is created and neither is specified. By default, DB2 will create PADDED indexes if you migrate from V7 to V8 but will create NOT PADDED indexes if you install V8 from scratch.

CAUTION

Remember the precaution from Chapter 5, "Data Definition Guidelines," to avoid defaults. Set up the PADIX parameter to create the type of indexes you prefer by default, but when creating indexes be sure to explicitly specify either PADDED or NOT PADDED so that you are assured of creating the type of index you want to create every time. Relying on defaults is lazy and potentially fraught with problems.

Indexing and Partitioning

The indexing requirements for partitioning change significantly in DB2 Version 8. For all releases of DB2 up through and including Version 7, a partitioning index is required to specify the limit keys for each partition; this means that a partitioning index was required and was used to determine which data goes into which partition.

This all changes with DB2 Version 8. To understand partitioning in DB2 V8, first we need to define some terminology: namely, partitioned versus partitioning.

Partitioned and non-partitioned ” A partitioned index means that the index itself is physically partitioned into separate data sets; a non-partitioned index, though, might still be a partitioning index.

Partitioning and secondary index ” A partitioning index means that the index keys correlate to the columns used to partition the data. The index might or might not also be partitioned.

Control of partitioning changes from index-controlled to table-controlled as of DB2 V8. Actually, DB2 V8 supports both types of partitioning, but table-controlled partitioning enables new features not supported under index-controlled partitioning. For example, the ability to easily add or rotate partitions is only supported with table-controlled partitioning.

CAUTION

DB2 will automatically switch from index-based to table-based partitioning if any of the following operations are performed:

Dropping the partitioning index
Altering the partitioning index to be not clustered
Adding a partition using ALTER TABLE ADD PART
Rotating partitions using ALTER TABLE ALTER PART ROTATE
Altering a partition using ALTER TABLE ALTER PART n
Creating a data-partitioned secondary index (DPSI)
Creating an index with the VALUES clause, but without the CLUSTER keyword

When creating partitioned table spaces using DB2 Version 8 or greater, use table-controlled partitioning instead of index-controlled partitioning. Also, you should seriously consider switching your current index-controlled partitioning structures to be table-controlled because of the additional flexibility and functionality it provides.

Clustering and Partitioning

Prior to DB2 V8, the partitioning index for a partitioned table space had to be a clustering index. This means that the data in the table space had to be clustered by the partitioning columns. As of DB2 V8, though, data in a partitioned table space no longer needs to be clustered by the partitioning key. That is, clustering and partitioning are completely independent from each other as of V8.

Data Partitioned Secondary Indexes

One of the biggest problems DBAs face when they are managing large partitioned DB2 table spaces is contending with non-partitioned indexes. DB2 Version 8 helps to remedy these problems with a new type of index ”the data partitioned secondary index, or DPSI (pronounced dipsy). However, before we examine the solution, let's first investigate the problem in a little more detail.

Problems With Non-Partitioning Indexes

Prior to V8, a partitioning index was required to define a partitioned table space. The CREATE INDEX statement specifies the range of values that DB2 will store in each specific partition. The partitioning index will have individual PART clauses, each of which specifies the highest value that can be stored in the partition. To illustrate , consider the following SQL to create a partitioning index:

 CREATE INDEX XEMP2   ON DSN8710.EMP (EMPNO ASC)     USING STOGROUP DSN8G710     PRIQTY 36 ERASE NO CLUSTER       (PART 1 VALUES('H99'),        PART 2 VALUES('P99'),        PART 3 VALUES('Z99'),        PART 4 VALUES('999'))     BUFFERPOOL BP1     CLOSE YES     COPY YES;

This creates four partitions. Behind the scenes, DB2 will create four separate data sets ”both for the table space data and for the index data. However, all other indexes defined on the table will be regular, non-clustering DB2 indexes ”that is, non-partitioning indexes (NPIs). An NPI resides in a single data set unless the PIECESIZE clause is used to break it apart ”and even then the data will not be broken apart by partition. (The PIECESIZE clause is covered in more detail later in this chapter.)

NOTE

As of DB2 V8, NPIs are also referred to as NPSIs (where the S stands for secondary). In this book, the terms NPI and NPSI are used synonymously.

NPIs can cause contention, particularly with DB2 utilities. You can run a utility against a single table space or index partition, but you do not have that luxury with NPIs because they are not partitioned. You can minimize and manage downtime by running utilities a partition at a time. However, running utilities against NPIs can impact the availability of an entire table space. Because an NPI contains data for an entire table space, not just for a single partition, utility operations on an NPI can cause downtime across an entire table space. Additionally, contention on NPIs can cause performance bottlenecks during parallel update, insert, and delete operations.

Solving Problems with DPSIs?

DB2 V8 introduces the Data Partitioned Secondary Index, or DPSI. DPSIs are significant because they help to resolve the problems involved with NPIs discussed in the preceding section. A DPSI is basically a partitioned NPI.

Consult Figure 6.2 for a graphical depiction of the difference between a DPSI and an NPI. This diagram shows a table space partitioned by month. We need to build an index on the CUSTNO for access requirements, but we have a choice as to whether we create an NPI or a DPSI. You can see the different results in the diagram: The DPSI will be partitioned by the same key ranges as the table, but the NPI will not be partitioned at all.

Figure 6.2. DPSI versus NPI.

So, with a DPSI the index will be partitioned based on the data rows. The number of parts in the index will be equal to the number of parts in the table space ”even though the DPSI is created based on columns that are different from those used to define the partitioning scheme for the table space. Therefore, partition 1 of the DPSI will be for the same rows as partition 1 of the table space, and so on. These changes provide many benefits, including

The ability to cluster by a secondary index
The ability to drop and rotate partitions easily
Potentially less overhead in data sharing

NPIs historically have caused DB2 performance and availability problems, especially with utilities. DPSIs solve many of these problems. With DPSIs there is an independent index tree structure for every partition. This means that utilities do not have to share pages or index structures. In addition, logical drains on indexes are now physically at the partition level. This helps utility processing in several useful ways. For example, you can run a LOAD by partition with no contention because the DPSI is partitioned the same way as the data and the partitioning index. Additionally, when reorganizing with DPSIs, the BUILD2 phase is not needed. Even your recovery procedures might be aided because you can copy and recover a single partition of a DPSI.

However, DPSIs are not magical objects that solve all problems. Indeed, changing an NPI to a DPSI will likely cause some queries to perform worse than before. Some queries will need to examine multiple partitions of the DPSI as opposed to the single NPI it previously used. On the other hand, if the query has predicates that reference columns in a single partition only, then performance might improve because only one DPSI partition needs to be probed.

Keep in mind that each DPSI partition has its own index structure. So, a query could potentially have to probe each of these individual index structures to use the DPSI. This type of operation, obviously, will not perform as well as a single probe that would be required against the NPI. So, of course, not every index on a partitioned table should be a DPSI. An additional drawback is that a DPSI cannot be defined as a unique index.

The bottom line on whether to create DPSIs or NPIs is that you will need to analyze your data access and utility processing requirements. DPSIs are easier to manage and can be processed more efficiently by utilities, but can require special query formulation to be efficient. NPIs are typically most efficient for general-purpose queries but come with a lot of administration baggage.

Before using DPSIs, you will have to examine your queries to determine predicate usage and the potential performance impact.

Indexes and Column Cardinality

Column cardinality (that is, the number of distinct values for a column) is an important consideration when building composite indexes. You should analyze column cardinality for columns in a multi-column index.

DB2 records and stores column cardinality in the DB2 Catalog and then uses this information when optimizing SQL to determine access paths. The following cardinality columns are used by the DB2 optimizer:

COLCARDF ” Contains the number of distinct values for this column. The optimizer uses this column to calculate the filter factor for a composite index when equals predicates are not specified for each column. Found in SYSIBM.SYSCOLUMNS for non-partitioned indexes or SYSIBM.SYSCOLSTATS for partitioned indexes.

FIRSTKEYCARDF ” Contains the number of distinct values of the first column in an index. This information is captured only for the first column in a multi-column index. It will be used by the DB2 optimizer when calculating the filter factor for a predicate on a single-column index. Found in SYSIBM.SYSINDEXES for non-partitioned indexes or SYSIBM.SYSINDEXSTATS for partitioned indexes.

FULLKEYCARDF ( SYSIBM.SYSINDEXES ) ” Contains the number of distinct values for the combined, entire key (all columns) of an index. The optimizer uses this column to calculate the filter factor for a composite index when an equality predicate is specified for each column of a multi-column index. Found in SYSIBM.SYSINDEXES for non-partitioned indexes or SYSIBM.SYSINDEXSTATS for partitioned indexes.

Modifying Indexes

As data is added to the table, it is also added to every index defined on the table. For INSERT operations, new keys are placed in the proper sequence in the index. Existing keys are moved to the right to make room for the new entry. If there is not enough room on the page for the new entry, DB2 will try to find space on neighboring pages or on the next free page. When a neighboring page is used, DB2 attempts to redistribute entries in the pages to accommodate the INSERT operation.

As data is deleted from the table, it must also be removed from every index defined on the table. The more indexes defined to a table, the more time it will take for DB2 to perform DELETE operations.

For UPDATE operations, indexes are impacted only if the columns being modified participate in an index. The more indexes containing the columns being modified, the longer the UPDATE will take.

Forming Index Levels

As data is added to the index it will grow and change. For very small indexes, the root page can also act as a leaf page. As data is added, the root page will fill up. When the index becomes too large to accommodate both root and leaf page data, DB2 creates two new leaf pages. Each of these leaf pages will contain half the entries that were in the original root page. The root page will contain pointers to the leaf pages. You have just witnessed the birth of a new index.

Over time, as more data is added, more index entries are added. Eventually, the root page grows too large, causing DB2 to create two more new pages. These will be nonleaf pages, each containing half of the entries that were in the root page. The root page now contains pointers to nonleaf pages ”and thus, another new level is born.

The greater the number of levels in an index, the less efficient it becomes. This is so because DB2 needs to perform an additional I/O operation for an index lookup for each new level in the index.

< Day Day Up >