The LOAD Utility

 <  Day Day Up  >  

The LOAD Utility

The LOAD utility is used to accomplish bulk inserts to DB2 tables. It can add rows to a table, retaining the current data, or it can replace existing rows with the new data.

Table Loading Philosophies

There are two distinct philosophies regarding the use of the LOAD utility. The first and generally recommended philosophy takes more time to implement but is easier to support. It requires the reservation of sufficient DASD to catalog the LOAD work data sets in case the LOAD job abends.

The work data sets for the LOAD job are allocated for the DDNAME s SORTOUT , SYSUT1 , SYSERR , and SYSMAP with DISP=(MOD,DELETE,CATLG) . This enables the data sets to be allocated as new for the initial running of the REORG job. If the job abends, it catalogs the data sets in case they can be used in a restart. After the step completes successfully, the data sets are deleted. The space for these data sets must be planned and available before the LOAD job runs.

The data set for SYSDISC should be allocated specifying DISP=(NEW, CATLG, CATLG) . If there are discards, the LOAD utility returns a RC=4 , and it does not abend. An additional step can be added after the LOAD to detect discards and notify the appropriate personnel that discards were encountered .

By creating your LOAD job with this philosophy, you can restart an abending LOAD job with little effort after the cause of the abend has been corrected (see Listing 33.1). You simply specify one of the RESTART options in the UTPROC parameter for DSNUTILB .

Listing 33.1. LOAD JCL (Restartable)
 //DB2JOBU  JOB (UTILITY),'DB2 LOAD',MSGCLASS=X,CLASS=X, //          NOTIFY=USER //* //**************************************************************** //* //*            DB2 LOAD UTILITY (RESTARTABLE) //* //**************************************************************** //* //UTIL EXEC DSNUPROC,SYSTEM=DSN,UID='LOADDATA',UTPROC=" //* //*  UTILITY WORK DATAETS //* //DSNUPROC.SORTWK01 DDUNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SORTWK02 DDUNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SORTOUT DD DSN=CAT.SORTOUT,DISP=(MOD,CATLG,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SYSMAP DD DSN=CAT.SYSUT1,DISP=(MOD,DELETE,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(2,1)),DCB=BUFNO=20 //DSNUPROC.SYSUT1 DD DSN=CAT.SYSUT1,DISP=(MOD,DELETE,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(2,1)),DCB=BUFNO=20 //DSNUPROC.SYSDISC DD DSN=CAT.SYSDISC,DISP=(MOD,DELETE,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(1,1)) //DSNUPROC.SYSERR DD DSN=CAT.SYSERR,DISP=(MOD,DELETE,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(1,1)) //DSNUPROC.SYSREC00 DD DSN=CAT.LOAD.INPUT.DATASETA,DISP=SHR,DCB=BUFNO=20 //DSNUPROC.UTPRINT DD SYSOUT=X //* //*  UTILITY INPUT CONTROL STATEMENTS //*       The LOAD statement reloads the DSN8810.ACT table //* //DSNUPROC.SYSIN    DD  *    LOAD DATA REPLACE INDDN SYSREC00 LOG NO    INTO TABLE DSN8810.ACT        (ACTNO       POSITION ( 1 )  SMALLINT,         ACTKWD      POSITION ( 3 )  CHAR ( 6 ),         ACTDESC     POSITION ( 9 )  VARCHAR        ) /* // 

NOTE

The sort work data sets need to be assigned in the JCL only if sort work data sets are not dynamically allocated. Additionally, you should consider explicitly defining sort work data sets when loading very large tables.


The second philosophy is easier to implement but more difficult to support. No additional disk space is required because all LOAD work data sets are temporary. Therefore, all interim work data sets are lost when the job abends. See Listing 33.2 for sample JCL.

Listing 33.2. LOAD JCL (Nonrestartable)
 //DB2JOBU  JOB (UTILITY),'DB2 LOAD',MSGCLASS=X,CLASS=X, //   NOTIFY=USER,REGION=3M //* //**************************************************************** //* //*            DB2 LOAD UTILITY (NON-RESTARTABLE) //* //**************************************************************** //* //UTIL EXEC DSNUPROC,SYSTEM=DSN,UID='LOADDATA',UTPROC=" //* //*  UTILITY WORK DATASETS //* //DSNUPROC.SORTWK01 DD DSN=&&SORTWK01, //        UNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SORTWK02 DD DSN=&&SORTWK02, //        UNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SORTOUT DD DSN=&&SORTOUT, //        UNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SYSMAP DD DSN=CAT.SYSUT1,DISP=(MOD,CATLG,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SYSUT1 DD DSN=&&SYSUT1,DCB=BUFNO=10 //        UNIT=SYSDA,SPACE=(CYL,(2,1)) //DSNUPROC.SYSDISC DD DSN=CAT.SYSDISC,DISP=(MOD,CATLG,CATLG), //        UNIT=SYSDA,SPACE=(CYL,(1,1)) //DSNUPROC.SYSERR DD DSN=&&SYSERR, //        UNIT=SYSDA,SPACE=(CYL,(1,1)) //DSNUPROC.SYSREC00 DD DSN=CAT.LOAD.INPUT.DATASETD,DISP=SHR,DCB=BUFNO=10 //DSNUPROC.UTPRINT DD SYSOUT=X //* //*  UTILITY INPUT CONTROL STATEMENTS //*       The LOAD statement adds the data in SYSREC00 to //*       the DSN8810.DEPT table. //* //DSNUPROC.SYSIN    DD  *    LOAD DATA RESUME(YES) ENFORCE CONSTRAINTS LOG NO    INDDN SYSREC00 INTO TABLE DSN8810.DEPT        (DEPTNO       POSITION(    1)                      CHAR(        3),         DEPTNAME     POSITION(    4)                      VARCHAR,         MGRNO        POSITION(   42)                      CHAR(        6)  NULLIF(   48)='?',         ADMRDEPT     POSITION(   49)                      CHAR(        3),         LOCATION     POSITION(   52)                      CHAR(       16)  NULLIF(   68)='?'       ) /* // 

To restart this LOAD job, you must determine in which phase the job abended. If the job abends in any phase of a LOAD REPLACE , you can simply terminate the utility and rerun. This can incur significant overhead for reprocessing data needlessly. If the first philosophy is used, reprocessing is usually avoided.

For a LOAD RESUME(YES) , however, if the job abends in any phase other than UTILINIT , you must restore the table space for the table being loaded to a previous point in time. This can be accomplished by running the RECOVER TOCOPY utility or by running a full RECOVER if the LOG NO option of the LOAD utility was specified. After restoring the table space (and possibly its associated indexes), you must correct the cause of the abend, terminate the utility, and then rerun the job. As you can see, this method is significantly more difficult to restart than the first method.

Try to use the first philosophy rather than the second. This makes recovery from error situations as smooth and painless as possible.

Estimating LOAD Work Data Set Sizes

The LOAD utility requires work data sets to load data into DB2 tables. The following formulas can help you calculate estimated sizes for these work data sets. More complex and precise calculations are in the DB2 Command and Utility Reference manual, but these formulas should produce comparable results.

 

 SORTOUT = (size of the largest index key or foreign key + 14) x (total number of rows in graphics/ccc.gif the table to be loaded) x (total number of indexes defined for the table) x (total number graphics/ccc.gif of foreign keys in the table) x 1.2 

NOTE

If any number in the SORTOUT calculation is 0, substitute 1.

The multiplier 1.2 is factored into the calculation to provide a "fudge factor." If you are absolutely sure of your numbers , the calculation can be made more precise by eliminating the additional multiplication of 1.2.


 

 SYSUT1 = (size of the largest index key or foreign key + 14) x (total number of rows to be graphics/ccc.gif loaded to the table) x (total number of indexes defined for the table) x (total number of graphics/ccc.gif foreign keys in the table) x 1.2 

NOTE

If any number in the SYSUT1 calculation is 0, substitute 1. The multiplier 1.2 is factored into the calculation to provide a "fudge factor." If you are absolutely sure of your numbers, the calculation can be made more precise by eliminating the additional multiplication of 1.2.


 

 SORTWKxx = (size of SYSUT1) x 2  SYSERR = ((number of estimated unique index errors) + (number of estimated data graphics/ccc.gif conversion errors) + (number of estimated referential constraint violations)) x 100 

NOTE

Always allocate the SYSERR data set to be at least 1 cylinder.


 

 SYSMAP = (total number of rows to be loaded to the table) x 21 

NOTE

The SYSMAP data set is required if either of the following is true:

  • Discard processing is requested .

  • The table space is segmented or partitioned.


 

 SYSDISC = Allocate the SYSDISC data set to be the same size as the data set containing the graphics/ccc.gif rows to be loaded by the LOAD utility 

NOTE

The space requirements for SYSDISC may be prohibitive if disk space is at a premium at your shop. Instead of allocating the SYSDISC data set as large as the data being loaded, consider using a small primary quantity and a larger secondary quantity ”for example:

 

 SPACE=(CYL,(0,50),RLSE) 


NOTE

Although the SYSDISC data set is optional, specifying it is highly recommended to trap records that cannot be loaded.


After calculating the estimated size in bytes for each work data set, convert the number into cylinders , rounding up to the next whole cylinder. Allocating work data sets in cylinder increments enhances the utility's performance.

LOAD Phases

There are nine possible phases of the LOAD utility:

UTILINIT

Sets up and initializes the LOAD utility.

RELOAD

Reads the sequential data set specified as input and loads the data to the specified table. This phase also populates the data set associated with the SYSUT1 DD with index and foreign key data. The compression dictionary is rebuilt in this step for COMPRESS YES table spaces. The copy pending flag is reset at the end of this phase if an inline copy is produced (unless the SORTKEYS parameter is specified).

SORT

Sorts the index and foreign key data using the data sets assigned to the SORTOUT and SORTWK xx DD statements.

BUILD

Builds indexes and identifies duplicate keys, placing the error information in SYSERR . The recovery pending flag is reset for all non-unique indexes. The copy pending flag is reset at the end of this phase if an inline copy is produced unless the SORTKEYS parameter is specified.

SORTBLD

When parallel index build is specified ( SORTKEYS ), the SORT and BUILD phases are performed in the SORTBLD phase instead.

INDEXVAL

Reads the SYSERR data set to correct unique index violations. The recovery pending flag is reset for all unique indexes.

ENFORCE

Checks foreign keys for conformance to referential constraints and stores the error information in SYSERR . Resets check pending flag for table space.

DISCARD

Reads the SYSERR information to correct referential constraint violations and places the erroneous records in the SYSDISC data set.

REPORT

Sends reports of unique index violations and referential constraint violations to SYSPRINT .

UTILTERM

Performs the final utility cleanup.


NOTE

The SORT phase will be skipped if the RELOAD phase analyzes the data and determines that the SORT phase is not needed.


Creating an Inline Copy During the LOAD

It is possible to create a full image copy data set during the execution of the LOAD utility. This is referred to as an inline COPY . The image copy will be a SHRLEVEL REFERENCE copy.

There are two major benefits of taking an inline copy. The first is that a second pass of the data is not required to create a DB2 image copy. The second benefit is that the table space into which the data is being loaded will not be placed into a copy pending state when inline copy is specified, even if the LOG NO option is specified.

To create an inline copy, use the COPYDDN and RECOVERYDDN keywords. You can specify up to two primary and two secondary copies.

Gathering Inline Statistics During the LOAD

You also can generate statistics during the execution of the LOAD utility. This is referred to as inline RUNSTATS . Up-to-date statistics will be generated during the LOAD instead of requiring an additional RUNSTATS step. To generate inline RUNSTATS , use the STATISTICS keyword. You can gather table space statistics, index statistics, or both.

Discards and Inline RUNSTATS

If you specify both the DISCARDDN and STATISTICS options, the inline statistics collected during the LOAD may be inaccurate. When a row is found with check constraint errors or conversion errors, the row is not loaded into the table, so DB2 will not collect statistics for it. So far, so good.

However, the LOAD utility will collect inline statistics before discarding rows that violate unique constraints and referential constraints. Therefore, when the number of rows that violate RI and unique constraints is high, the statistics could be quite imprecise. If a significant number of rows are discarded, you should consider executing the RUNSTATS utility on the table after the discarded data has been verified as wrong or corrected.

Loading Delimited Input Data Sets

graphics/v8_icon.gif

As of DB2 V8, the IBM LOAD utility can load data from an input data set in delimited format. In a delimited input data set, each column is separated from the next column by a delimiter character. Additionally, all the fields in the input data file must be character strings or external numeric values. Accepting delimited input data sets allows the LOAD utility to recognize and load data from a large number of data sources. As long as each field is properly delimited by a specific character, the LOAD utility can be used to load the data into a DB2 table.


CAUTION

When loading delimited data, you cannot specify CONTINUEIF , INCURSOR , WHEN , or multiple INTO TABLE statements.


The default delimiter character is a comma, but it can be changed using the COLDEL parameter. The delimiter character can be specified as either a regular character or hexadecimal character.

CAUTION

When you specify the delimiter character, be sure to verify that the character is specified in the code page of the source data. Furthermore, if the utility control parameter is coded in a different character type than the input file, specify the COLDEL in hex or the result can be unpredictable. For example, specify the delimiter as a hex constant if the utility control statement is coded in EBCDIC and the input data is ASCII or Unicode.


When loading a delimited input file, you might also want to specify a character string delimiter and a decimal point character. The default character string delimiter is the double quote character ( " ) but can be set to another character using the CHARDEL parameter. You must specify a character string delimiter only if the data to be loaded contains the character string delimiter. Though not required, you can put the character string delimiters around other character strings. To load a string containing the character delimiter character, you must double up the character. For example, code the following to LOAD a string containing: He told me "You look well" and I liked it.

 

 "He told me ""You look well"" and I liked it." 

Finally, you can change the decimal point character, too. This is done using the DECPT parameter. The default is a period, but some countries use a comma. Therefore, you can change the decimal point character to another character of your choice using DECPT .

LOAD Rerun/Restart Procedures

The LOAD utility can be restarted. The restart or rerun procedure is determined by the abending phase of the LOAD step. There are two ways to determine the phase in which the abend occurred.

The first method is to issue the DISPLAY UTILITY command to determine which utilities are currently active, stopped , or terminating in the DB2 system. The format of the command is

 

 -DISPLAY UTILITY(*) 

The second method to determine the abending phase is to view the SYSPRINT DD statement of the LOAD step. This method is not as desirable as the first, but it is the only method you can use when the DB2 system is down. At the completion of each phase, DB2 prints a line stating that the phase has completed. You can assume that the phase immediately following the last phase reported complete in the SYSPRINT DD statement is the phase that was executing when the abend occurred.

After determining the phase of the LOAD utility at the time of the abend, follow the steps outlined here to restart or rerun the load. In the following procedures, it is assumed that your LOAD utility processing is generally restartable.

If the abend occurred in the UTILINIT phase

  1. Determine the cause of the abend. An abend in this step is usually caused by another utility executing with the same UID or a utility that is incompatible with another utility currently executing.

  2. Resolve the cause of the abend. An abend in this phase is probably due to improper job scheduling. Issue the DISPLAY UTILITY command to determine which utilities are currently in process for the DB2 system. Resolve the scheduling problem by allowing conflicting utilities to complete before proceeding to step 3.

    Another possible cause is insufficient sort space. If the SORTWKxx data sets are dynamically added, try to resolve the problem using the following methods :

    • Use the SORTDEVT clause to dynamically create the SORTWKxx data sets someplace else.

    • Use the SORTNUM clause to increase the number of dynamically allocated sort work files.

    • Clean the work packs by deleting or moving extraneous files.

    • Explicitly allocate the appropriate sort work data sets in the JCL.

  3. Restart the job at the LOAD step.

If the abend occurred in the RELOAD phase

  1. Determine the cause of the abend. An abend in this step is usually caused by insufficient space allocated to the SYSUT1 DD statement. Another cause is that the VSAM data set associated with the table space has run out of available DASD space.

  2. Resolve the cause of the abend.

    1. If the problem is an out-of-space abend (B37) on the SYSUT1 DD statement, the data set associated with that DD statement will have been cataloged. Allocate a new data set with additional space, copy the SYSUT1 data set to the new data set, delete the original SYSUT1 data set, and rename the new data set to the same name as the original SYSUT1 data set.

    2. If the problem is an out-of-space abend on the VSAM data set containing the table space being reloaded, contact the DBA or DASD support unit. This situation can be corrected by adding another volume to the STOGROUP being used; using IDCAMS to redefine the VSAM data set, move the VSAM data set, or both; or altering the primary space allocation quantity for the index, the secondary space allocation quantity for the index, or both.

      Restart the job at the LOAD step with a temporary JCL change to alter the UTPROC parameter to RESTART .

CAUTION

Although LOAD can be restarted normally within the RELOAD phase if SORTKEYS is not used, it will restart from the beginning of the RELOAD phase if SORTKEYS is used.


If the abend occurred in the SORT phase

  1. Determine the cause of the abend. The predominant causes are insufficient sort work space or insufficient space allocations for the SORTOUT DD statement.

  2. Resolve the cause of the abend. If the problem is insufficient space on the sort work or SORTOUT DD statements, simply increase the allocations and proceed to step 3.

  3. Restart the job at the LOAD step with a temporary change to alter the UTPROC parameter to RESTART(PHASE) .

If the abend occurred in the BUILD phase

  1. Determine the cause for the abend. An abend in this step is usually caused by insufficient space allocated to the SYSERR DD statement. Another cause is that the VSAM data set associated with the index space has run out of available DASD space.

  2. Resolve the cause of the abend.

    1. If the problem is an out-of-space abend (B37) on the SYSERR DD statement, the data set associated with the DD statement will have been cataloged. Allocate a new data set with additional space, copy the SYSERR data set to the new data set, delete the original SYSERR data set, and rename the new data set to the same name as the original SYSERR data set.

    2. If the problem is an out-of-space abend on the VSAM data set containing the index space being reloaded, contact the DBA or DASD support unit. This situation can be corrected by adding another volume to the STOGROUP being used; using IDCAMS to redefine the VSAM data set, move the VSAM data set, or both; or altering the primary space allocation quantity for the index, the secondary space allocation quantity for the index, or both.



    1. If LOAD was run using the REPLACE option, restart the job at the LOAD step with a temporary change to alter the UTPROC parameter to RESTART(PHASE) .

    2. If LOAD was run using the RESUME YES option, the LOAD is not restartable. Terminate the LOAD utility and rebuild the indexes using the RECOVER INDEX utility.

NOTE

When the SORTKEYS parameter is used and the LOAD utility terminates during the RELOAD , SORT , or BUILD phases, both RESTART and RESTART(PHASE) restart from the beginning of the RELOAD phase.


If the abend occurred in the INDEXVAL phase

  1. Determine the cause of the abend. Abends in this phase are rare. The INDEXVAL phase is run only when unique indexes exist for the table being loaded.

  2. Resolve the cause of the abend.

  3. Restart the job at the LOAD step with a temporary JCL change to alter the UTPROC parameter to RESTART(PHASE) .

If the abend occurred in the ENFORCE phase

  1. Determine the cause for the abend. An abend in this step is usually caused by insufficient space allocated to the SYSERR DD statement. The ENFORCE phase is optional and is not always run.

  2. Resolve the cause of the abend. If the problem is an out-of-space abend ( B37 ) on the SYSERR DD statement, the data set associated with that DD statement will have been cataloged. Allocate a new data set with additional space, copy the SYSERR data set to the new data set, delete the original SYSERR data set, and rename the new data set to the same name as the original SYSERR data set.

  3. Restart the job at the LOAD step with a temporary change to alter the UTPROC parameter to RESTART .

If the abend occurred in the DISCARD phase

  1. Determine the cause for the abend. An abend in this step is usually caused by insufficient space allocated to the SYSDISC DD statement. The DISCARD phase is optional and is not always run.

  2. Resolve the cause of the abend. If the problem is an out-of-space abend ( B37 ) on the SYSDISC DD statement, the data set associated with that DD statement will have been cataloged. Allocate a new data set with additional space, copy the SYSDISC data set to the new data set, delete the original SYSDISC data set, and rename the new data set to the same name as the original SYSDISC data set.

  3. Restart the job at the LOAD step with a temporary change to alter the UTPROC parameter to RESTART .

If the abend occurred in the REPORT phase

  1. Determine the cause for the abend. Abends in the REPORT phase are rare. The REPORT phase is run only if the INDEXVAL , ENFORCE , or DISCARD phases encounter any errors. Sometimes the cause for an abend in this phase is insufficient space allocated to the sort work data sets because the report is sorted by error type and input sequence.

  2. Resolve the cause of the abend. If the problem was caused by insufficient space on the sort work or SORTOUT DD statements, simply increase the allocations and proceed to step 3.

  3. Restart the job at the LOAD step with a temporary change to alter the UTPROC parameter to RESTART(PHASE) .

If the abend occurred in the UTILTERM phase

  1. An abend in this phase is unlikely because all the work required for the load has been completed. A problem at this phase means that DB2 cannot terminate the utility.

  2. Terminate the DB2 utility by issuing the TERM UTILITY command. The format of the command is

     

     -TERM UTILITY(  UID  ) 

    where UID is obtained from the -DISPLAY UTILITY (*) command.

  3. If the LOAD utility work data sets associated with this job were cataloged as a result of the abend, uncatalog them and force the job's completion.

LOAD Locking and Concurrency

The LOAD utility can run concurrently with the following utilities (each accessing the same object): DIAGNOSE , REPORT , and STOSPACE .

The LOAD utility will drain all claim classes for the table space or partition being loaded and any associated indexes, index partitions, and logical index partitions. Furthermore, if the ENFORCE option is specified, LOAD will drain the write claim class for the primary key index.

Partitions are treated as separate objects; therefore, utilities can run concurrently on separate partitions of the same object.

graphics/v7_icon.gif

Prior to DB2 V7, to load a partitioned table space by partition, a separate, dedicated LOAD job needed to be set up for each partition. Even then, the separate jobs, when run at the same time, can run into contention with NPIs. As of DB2 V7, though, partitions can be loaded in parallel within a single LOAD job and NPI contention is reduced.


An additional benefit of parallel loading is that the input data need not be broken out into separate data sets. And the same goes for the error data set and mapping data set.

graphics/v7_icon.gif

Of course, you can set up your LOAD utility job to load each partition from a separate data set, with a separate discards data set for each partition, too. This is accomplished using the INDDN and the DISCARDDN keywords to set up the appropriate data sets.


NOTE

Be aware that the INDDN and DISCARDDN options can only be specified if the PART keyword is also specified. They cannot be used with segmented or simple table spaces.


The actual number of parallel RELOAD tasks to be run is determined by the number of CPUs, the availability of virtual storage, and the number of available DB2 threads.

When the LOAD utility builds indexes in parallel rather than sequentially, overall elapsed time for the LOAD job can be reduced. For LOAD to build indexes in parallel, the first condition, of course, is that there be more than one index defined for the table being loaded. If that is the case, the SORTKEYS clause must be specified with an estimate for the number of keys, and sort work data sets must be allocated to the LOAD job (either explicitly or dynamically).

Online Loading
graphics/v7_icon.gif

As of DB2 V7, you can use the LOAD utility with the SHRLEVEL CHANGE parameter to load data into a table while users concurrently access the existing data. This feature is commonly referred to as an online LOAD resume.


Previously, running the LOAD utility made the table data unavailable. Of course, you could always code a program to insert the data, but that is not very efficient or simple.

Online loading will work using normal SQL INSERT s. The LOAD will perform normal claim processing and no drains. But all normal SQL processing activities will occur. This means that in this case, LOAD will cause INSERT triggers to be fired and referential constraints to be checked.

LOAD Guidelines

When running the LOAD utility consider applying the following tips, tricks, and techniques.

Consider Using SORTKEYS

When index keys are not already in sorted order and indexes exist on the table into which data is being loaded, consider using the SORTKEYS keyword. When SORTKEYS is specified, index keys are sorted in memory, rather than being written to work files. This can improve performance by:

  • Eliminating the expensive I/O operations to disk

  • Reducing the space requirements for the SYSUT1 and SORTOUT data sets

  • Reducing elapsed time from the start of the reload phase to the end of the build phase

An estimate of the number of keys to be sorted can be supplied. This is optional, but recommended because the extracted keys will be written to a work data set, minimizing the efficiency gains of using the SORTKEYS parameter. To estimate the number of keys to sort, use the following calculation:

 

 Number of Keys = (Total number of rows to be loaded) x                  [(number of indexes on the table) +                   (number of foreign keys {unless index exists for the FK}) +                   ((number of foreign keys participating in multiple                     relationships) x (number of relationships - 1))                  ] 

NOTE

If more than one table is being loaded, the preceding calculation must be repeated for each table ”the sum of the results is used.


graphics/v7_icon.gif

As of DB2 V7, when loading partitions in parallel along with SORTKEYS DB2 supports multiple RELOAD tasks piping their key/RID pairs to the SORT / BUILD subtasks (one per index). The number of tasks that can be supported is roughly equal to the number of partitions. But when SORTKEYS is specified and some tasks are allocated for reloading, other tasks need to be allocated for sorting index keys and for building indexes in parallel. Thus the number of RELOAD tasks may be reduced in order to improve the overall performance of the entire LOAD job.


Avoid the LOAD Utility for Tables with Triggers

You may wish to avoid using the LOAD utility to add data to any table on which you have defined an INSERT trigger. Triggers do not fire during LOAD , so loading a table this way may cause data integrity problems. Instead, code a program to insert the data as needed because INSERT will cause the trigger to fire appropriately.

NOTE

Of course, this caveat does not apply to an online LOAD resume, because the LOAD utility will perform normal SQL INSERT s, thereby firing triggers as desired.


Consider Serializing Loads for Tables in the Same Database

The LOAD utility is sensitive to concurrent processing. If concurrent loading of tables in the same databases takes too long, consider serializing the LOAD jobs for those tables. Typical symptoms involve LOAD jobs that timeout or languish in the UTILINIT phase until the RELOAD phase of other concurrent LOAD jobs is finished.

NOTE

Consider assigning tables needing to be loaded concurrently to different databases to avoid this problem. Another approach is to assign only one table per database.


Use LOAD to Append or Replace Rows

You can use LOAD to replace data in a table by specifying the REPLACE option. LOAD also can append new data to a table, leaving current data intact, by specifying the RESUME(YES) option. Choose the appropriate option based on your data loading needs.

Use LOAD to Perform Mass Deletes

Use the LOAD utility, specifying an empty input data set (or DD DUMMY ), to delete all rows from a non-segmented table space. This is called a mass delete . LOAD is usually more efficient than DELETE SQL without a WHERE clause. Specifying the LOG NO option to avoid logging data changes will further enhance the performance of the mass delete. Note, however, the following considerations:

  • If multiple tables are assigned to a simple table space, the LOAD utility deletes all rows for all tables in that table space.

  • Consider loading a DUMMY data set even for segmented table spaces if a large amount of data must be deleted. Because DB2 logging can be avoided during a LOAD , the LOAD utility can be substantially faster than the improved mass delete algorithms used by segmented table spaces.

Use Fixed Blocked Input

To enhance the performance of the LOAD utility, use a fixed blocked input data set rather than a variable blocked data set.

Buffer the Work Data Sets Appropriately

For large loads, set the BUFNO parameter in the JCL for the SYSUT1 DD statement to a number greater than 20. A BUFNO of approximately 20 is recommended for medium- sized indexes, and a BUFNO between 50 and 100 is recommended for larger tables. The BUFNO parameter creates read and write buffers in main storage for the data set, thereby enhancing the performance of the LOAD utility. The default for BUFNO is 8 for DB2 V3 and 20 for DB2 V4.

Ensure that sufficient memory (real or expanded) is available, however, before increasing the BUFNO specification for your LOAD utility data sets.

Enforce RI During Table Loading When Possible

Favor using the ENFORCE option of the LOAD utility to enforce referential constraints instead of running CHECK DATA after the LOAD completes. It is usually more efficient to process the loaded data once, as it is loaded, than to process the data twice, once to load it and once to check it. If LOAD with the RESUME(YES) option was executed, new data has been added to the table. However, if ENFORCE was not specified and a subsequent CHECK DATA is run, CHECK DATA will check the entire table, not just the new data.

Ensure That LOAD Input Data Sets Are in Key Sequence

Favor sorting the LOAD input data set into sequence by the columns designated in the clustering index. Be sure to sort the data in the appropriate sequence, either ascending or descending, depending on how the index was defined. Otherwise, the LOAD utility does not load data in clustering order, and the table space and indexes will be inefficiently organized.

NOTE

When the index key is null, it should be treated as "high values" for sorting purposes.


If you use DFSORT to sort the input data before loading, consider invoking the SORT DEDUPE option. Doing so not only can decrease the size of the file passed to the LOAD step, but it can minimize or eliminate LOAD discard processing.

Removing duplicates in DFSORT can improve performance because of the different ways that DFSORT and the LOAD utility handle duplicates. When DFSORT encounters duplicates, it sends one of the values to the output file then discards the remaining duplicates. When the LOAD utility encounters duplicates in the input file, it sends all of the duplicates to the discard file. Consider the following code for DFSORT :

 

 //SYSIN    DD *   SORT FIELDS=(1,4,BI,A,5,4,BI,A,9,12,CH,A)   SUM FIELDS=NONE /* 

This code indicates that DFSORT is to sort on three fields. The first starts in position 1 for 4 bytes, the second starts in position 5 for 4 bytes, and the third starts in position 9 for 12 bytes. The first two fields are unsigned binary and the third is character. And the sort is to be ascending for each field. Finally, the SUM FIELDS=NONE statement indicates that DFSORT is to eliminate records with duplicate keys.

Additionally, you can improve performance by removing unneeded records during the DFSORT step, instead of using a WHEN clause on the LOAD utility. Doing so can decrease the size of the file passed to the LOAD step and the DFSORT INCLUDE is more efficient than LOAD WHEN . Consider the following sample DFSORT code:

 

 //SYSIN    DD *   INCLUDE COND=(9,4,CH,EQ,C'CDBD') /* 

This code indicates that DFSORT is to start in position 9 and drop the record when the next four bytes equal 'CDBD'. More details on how to use DFSORT can be found in the IBM manual number SC33-4035, DFSORT Application Programming Guide .

REORG After Loading When the Input Is Not Sorted

If data is not loaded in clustering sequence, consider following the LOAD with a table space reorganization. This can be performed all the time, which is not recommended, or based on the value of CLUSTER RATIO stored in the DB2 Catalog for the table space and its clustering index. If CLUSTER RATIO is not 100% for a newly loaded table, the REORG utility should be used to cluster and organize the application data.

NOTE

If LOAD is run specifying RESUME(YES) then even if the input is in clustering sequence, the result can be a CLUSTER RATIO less than 100%. It is best to avoid sorting the input in this case. Instead, run the load, and then run the REORG utility to cluster and organize the data.


Favor the Use of LOG NO

Use the LOG NO option unless the table to be loaded is very small. Doing so avoids the overhead of logging the loaded data and speeds load processing. If data is loaded without being logged, however, follow the LOAD utility with a full image copy.

Specify KEEPDICTIONARY for Performance

The LOAD utility will rebuild the compression dictionary for table spaces defined with the COMPRESS YES parameter. Specifying the KEEPDICTIONARY parameter causes the LOAD utility to bypass dictionary rebuilding. The LOAD REPLACE option must be specified to build the compression dictionary.

This will improve the overall performance of the LOAD utility because the CPU cycles used to build the dictionary can be avoided. However, this option should be utilized only when you are sure that the same basic type of data is being loaded into the table. If the type of data differs substantially, allowing the LOAD utility to rebuild the compression dictionary will provide for more optimal data compression.

NOTE

Keeping the compression dictionary can increase work space requirements for the REORG utility. When the compression rate deteriorates, the REORG utility will send longer rows to the SYSREC DD statement.


Avoid Nullable Columns for Frequently Loaded Tables

Loading tables with nullable columns can degrade the LOAD utility's performance. If a table will be loaded frequently (daily, for example), consider reducing or eliminating the number of nullable columns defined to the table to increase the performance of the LOAD utility. This is not always practical or desirable because many program changes may be required to change columns from nullable to NOT NULL or to NOT NULL WITH DEFAULT . Additionally, nullable columns might make more sense than default values given the specification of the application.

Avoid Decimal Columns for Frequently Loaded Tables

Avoid DECIMAL columns for tables that are loaded frequently. Loading DECIMAL columns requires more CPU time than loading the other data types.

Avoid Data Conversion

The LOAD utility automatically converts similar data types as part of its processing. However, try to avoid data conversion, because the LOAD utility requires additional CPU time to process these conversions.

The following data conversions are performed automatically by the LOAD utility:

Original Data Type

Converted Data Type

SMALLINT

INTEGER

DECIMAL

FLOAT

INTEGER

SMALLINT

DECIMAL

FLOAT

DECIMAL

SMALLINT

INTEGER

FLOAT

FLOAT

SMALLINT

INTEGER

DECIMAL

CHAR

VARCHAR

LONG VARCHAR

VARCHAR

CHAR

LONG VARCHAR

GRAPHIC

VARGRAPHIC

LONG VARGRAPHIC

VARGRAPHIC

GRAPHIC

LONG VARGRAPHIC

TIMESTAMP EXT

DATE

TIME

TIMESTAMP


Reduce CPU Usage by Explicitly Coding All LOAD Parameters

Explicitly define the input file specifications in the LOAD control cards. Do this even when the data set to be loaded conforms to all the default lengths specified in Table 33.1. This reduces the LOAD utility's CPU use.

Table 33.1. Default LOAD Lengths

Column Data Type

Default Length

SMALLINT

2

INTEGER

4

DECIMAL

Column's precision

REAL

4

DOUBLE PRECISION

8

DATE

10

TIME

8

TIMESTAMP

26

CHAR

Column's length

VARCHAR

Column's maximum length

GRAPHIC

Double the column's length

VARGRAPHIC

Double the column's maximum length

ROWID

Varies

BLOB , CLOB , DBCLOB

Varies


If the input file specifications are not explicitly identified, the LOAD utility assumes that the input data set is formatted with the defaults specified in Table 33.1.

NOTE

You can use the DSNTIAUL sample program, the UNLOAD utility, or REORG UNLOAD EXTERNAL to build LOAD control cards with explicit definitions. The PUNCHDDN keyword is used to specify a data set for the control cards.


For BLOB , CLOB , and DBCLOB data, you must specify the length of the input field in bytes. This length value is placed in a four-byte binary field at the beginning of the LOB value. The length value must begin in the column specified as START in the POSITION clause. The END specification is not used for LOB s.

Consider Using LOAD Parameters to Edit Data Before Loading

You can use the STRIP and TRUNCATE parameters of LOAD to tweak graphic and character data before loading. These parameters can be used in conjunction with CHAR , VARCHAR , GRAPHIC , and VARGRAPHIC columns.

The STRIP parameter indicates that LOAD must remove specified characters from the beginning, the end, or both ends of the data prior to loading it. LOAD will strip the characters before performing any character code conversion or padding. If a specific character is not coded, the default is to strip blanks. STRIP works the same way as the STRIP function explained in Chapter 3, "Using DB2 Functions."

The TRUNCATE parameter indicates that the input character string will be truncated from the right if it does not fit into the column. LOAD will truncate the data after any required CCSID translation.

Create All Indexes Before Loading

It is usually more efficient to define all indexes before using the LOAD utility. The LOAD utility uses an efficient algorithm to build DB2 indexes.

If indexes must be created after the data has been loaded, create the indexes with the DEFER YES option and build them later using the REBUILD INDEX utility.

Favor LOAD over INSERT

To insert initial data into a DB2 table, favor the use of the LOAD utility with the REPLACE option over an application program coded to process INSERT s. LOAD should be favored even if the application normally processes INSERT s as part of its design. The initial loading of DB2 table data usually involves the insertion of many more rows than does typical application processing. For the initial population of table data, the LOAD utility is generally more efficient and less error-prone than a corresponding application program, and also maintains free space.

Consider using the LOAD utility with the RESUME(YES) option to process a large volume of table insertions. LOAD is usually more efficient and less error-prone than a corresponding application program that issues a large number of INSERT s.

Do Not Load Tables in a Multi-Table Simple Table Space

Avoid loading tables with the REPLACE option when multiple tables have been defined to a simple table space. The LOAD utility with the REPLACE option deletes all rows in all tables in the simple table space, which is not usually the desired result.

Gather Statistics When Loading Data

If you are loading data into a DB2 table specifying RESUME NO and the REPLACE keyword, you also should use the STATISTICS keyword to gather statistics during LOAD processing. These keywords specify that you are loading a table from scratch and that any previous data will be lost. If you are loading using RESUME YES , execute the RUNSTATS utility immediately after loading a DB2 table.

Accurate statistics are necessary to maintain current information about your table data for access path determination. Of course, access paths for static SQL will not change unless all packages and plans accessing the table are rebound. Any dynamic SQL statements will immediately take advantage of the new statistics.

Consider Loading by Partition

Concurrent loading of multiple partitions of a single table space can be achieved using partition independence. This technique is useful for reducing the overall elapsed time of loading a table in a partitioned table space.

Use Data Contingency Options As Required

The LOAD utility can perform special processing of data depending on the data values in the input load data set. Data contingency processing parameters indicate a field defined in the LOAD parameters or a beginning and ending location of data items to be checked. The data contingency processing parameters follow:

NULLIF

Sets column values to null if a particular character string is found at a particular location ”for example:

 NULLIF (22) = '?' 

DEFAULTIF

Sets column values to a predefined default value if a particular character string is found at a particular location. For example

 DEFAULTIF FIELD = 'DEFLT' 

WHEN

Limits the loaded data to specific records in the load input data set. For example

 LOAD DATA REPLACE INTO DSN8510.DEPT WHEN (1 : 3) = 'A00' 

CONTINUEIF

Used when there are record types in the input load data set. Specifies that loading will continue, logically concatenating the next record to the previous input record. For example

 LOAD DATA INTO DSN8510.EMP CONTINUEIF (10 : 10) = 'X' 


CAUTION

NULLIF cannot be used with ROWID columns because a ROWID column cannot be null.


Separate Work Data Sets

Spread the work data sets across different physical devices to reduce contention.

Use Caution When Loading ROWID Data

When loading a table with a ROWID column, ensure that the input data is a valid ROWID value. The appropriate way to do this is to load ROWID values that were previously generated by DB2.

If the ROWID is defined with the GENERATED ALWAYS keyword, you cannot load data into that column. Instead, the ROWID value must be generated by DB2.

Handle Floating Point Data

Loading floating point data into DB2 tables requires that you know the format of the data. Two options are available for loading floating point data:

S390

Floating point data is specified in System/390 hexadecimal floating point format. This is the default value. It is also the format in which DB2 stores floating point numbers.

IEEE

Floating point data is specified in IEEE binary floating point format. DB2 expects to find the input numbers in binary floating point format and will convert the data to hexadecimal floating point format as the data is loaded.


NOTE

If a conversion error occurs while converting from binary floating point format to hexadecimal floating point format, DB2 will place the record in the discard file.


Optimize Sort Utility Processing

Be sure to optimize the operation of the sort utility in use at your shop. For example, you can assign additional resources to DFSORT using the following DD statement:

 

 //DFSPARM  DD * HIPRMAX=0,EXCPVR=NONE,SIZE=32768K 

Additionally, consider using the SORTNUM clause to increase the number of dynamically allocated files and use the SORTDEVT clause to assign the disk pool.

Be Aware of the Impact of Multi-Level Security on LOAD

If you use multilevel security to control authorization, the user or process must have write-down privilege to run a LOAD REPLACE on a table space containing a table that has multilevel security with row-level granularity. This allows you to specify values for the security label columns.

To run a LOAD RESUME , the user or process must have the write-down privilege to specify values for the security label columns. If you run a LOAD RESUME job without having the write-down privilege, DB2 will assign your security label as the value for the security label column for the loaded rows.

 <  Day Day Up  >  


DB2 Developers Guide
DB2 Developers Guide (5th Edition)
ISBN: 0672326132
EAN: 2147483647
Year: 2004
Pages: 388

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net