A LOAD DATA Diagnostic Utility | Importing and Exporting Data

10.38.1 Problem

LOAD DATA or mysqlimport indicates a nonzero warning count when you load a datafile into MySQL, but you have no idea which rows or columns were problematic.

10.38.2 Solution

Run the file through a utility that diagnoses which data values caused the warnings.

10.38.3 Discussion

As a bulk loader, LOAD DATA is very efficient; it can run many times faster than a set of INSERT statements that adds the same rows. However, LOAD DATA also is not very informative. It returns only a message that indicates the number of records processed, and a few other status counts. For example, in the previous section, we generated a datafile managers.txt to use with guess_table.pl for guessing the structure of the baseball1.com managers table. If you create that table using the resulting CREATE TABLE statement and then load the datafile into it, you will observe the following result:

mysql> LOAD DATA LOCAL INFILE 'managers.txt' INTO TABLE managers
 -> IGNORE 1 LINES;
Query OK, 2841 rows affected (0.06 sec)
Records: 2841 Deleted: 0 Skipped: 0 Warnings: 5082

Evidently, there were a quite a few problems with the file. Unfortunately, the message produced by LOAD DATA doesn't tell you anything about which rows and columns caused them. The mysqlimport program is similarly terse, because its message is the same as the one returned by LOAD DATA.

We'll revisit this example at the end of the section, but first consider LOAD DATA's output style. On the one hand, the minimal-report approach is the right one to take. If warning information were to be returned to the client, it potentially could include a diagnostic message for each input row, or even for each column! This might be overwhelming and certainly would entirely defeat the high-efficiency nature of LOAD DATA. On the other hand, more information about the source of errors could be useful for fixing the file to eliminate the warnings.

It's on the MySQL development "to do" list to allow LOAD DATA errors to be logged to another table so that you can get extended diagnostic information. In the meantime, you can use the load_diag.pl utility included in the transfer directory of the recipes distribution. load_diag.pl is useful for "pre-flighting" a datafile to get an idea of how well the file will load into the table you intend it for, and to pinpoint problems so that you can clean up the file before loading it into MySQL "for real."

load_diag.pl also can help you identify patterns of problems for situations in which it may be beneficial to write a preprocessing filter. Suppose you periodically receive files containing data to be loaded into a given MySQL table. The more frequently this occurs, the more highly motivated you'll be to automate as much of the data transfer process as possible. This may involve writing a filter to convert data values from the format in which you receive them to a format more appropriate for MySQL. Running the datafiles through load_diag.pl can help you assess which columns tend to be problematic and thereby assist you in determining where to concentrate your efforts in creating a transformation program for rewriting the files so they will load cleanly into MySQL.

To run load_diag.pl, specify the name of the database and table you intend to load the datafile into, as well as the name of the file itself:

% load_diag.pl  db_name  tbl_name  file_name

load_diag.pl won't actually load anything into the table named on the command line, but it needs to know what the table is so that it can create a temporary table that has the same column structure to use for testing.

Initially, load_diag.pl loads the entire datafile into the temporary table to see if there are any warnings. If not, there's nothing else to do, so load_diag.pl drops the temporary table and exits. Otherwise, it loads each line of the datafile into the table individually to determine which lines caused problems, using the following procedure:

It writes the line to a temporary file and issues a LOAD DATA statement to load the file into the table. If the warning count is zero, the line is assumed to be okay.
If the warning count for the line is nonzero, load_diag.pl examines each of its columns in turn by using a series of single-column LOAD DATA statements to find out which ones generate warnings.
If a column-specific warning occurs and the data value is empty, load_diag.pl determines whether the warning goes away by loading a NULL value instead. It does this because if a datafile contains empty values, you can often get better results by loading NULL than by loading empty strings. (For example, if you load an empty string into an INT column, MySQL converts the value to 0 and issues a warning.) If a datafile turns out to have a significant reduction in warnings when loading NULL rather than empty strings, you may find it useful to run the file through to_null.pl before loading it.
It's also possible for warnings to occur if a line contains fewer or more columns than the number of columns in the table, so load_diag.pl checks that, too.

load_diag.pl prints diagnostic information about its findings while testing each input line, then prints a summary report after the entire file has been processed. The report indicates the number of lines in the file, how many warnings the initial full-file load caused, and the number of lines that had too few or too many columns. The report also includes a list that shows for each column how many values were missing, the number of warnings that occurred, how many of those warnings occurred for empty values, and the number of empty-value warnings that went away by loading NULL instead.

As you might guess, all this activity means that load_diag.pl isn't nearly as efficient as LOAD DATA. In fact, it has the potential to exercise your server rather heavily! But its goal is to provide maximal information, not minimal execution time. (Note too that if your MySQL server has logging enabled, using load_diag.pl with large datafiles can cause the logs to grow quickly.)

To see how load_diag.pl works, assume you have a simple table named diag_test that contains string, date, and number columns:

CREATE TABLE diag_test
(
 str CHAR(10),
 date DATE,
 num INT
);

Assume you also have a datafile named diag_sample.dat that you plan to load into the table:

str1 01-20-2001 97
str2 02-28-2002
 03-01-2002 64 extra junk

To see if the file will have any problems loading, check it like this:

% load_diag.pl cookbook diag_test diag_sample.dat
line 1: 1 warning
 column 2 (date): bad value = (01-20-2001)
line 2: 2 warnings
 too few columns
 column 2 (date): bad value = (02-28-2002)
 column 3 (num): missing from input line
 column 3 (num): bad value = ( ) (inserting NULL worked better)
line 3: 1 warning
 excess number of columns

Number of lines in file: 3
Warnings found when loading entire file: 4
Lines containing too few column values: 1
Lines containing excess column values: 1

Warnings per column:

Column Times Total Warnings for Improved
 missing warnings empty columns with NULL
str 0 0 0 0
date 0 2 0 0
num 1 1 1 1

It appears that the dates don't load very well. That's not surprising, because they appear to be in U.S. format and should be rewritten in ISO format. Converting empty fields to N may also be beneficial, and you can get rid of the extra column value in line 3. Using some of the utilities developed earlier in this chapter, perform all those transformations, writing the result to a temporary file:

% yank_col.pl --columns=1-3 diag_sample.dat 
 | cvt_date.pl --iformat=us --oformat=iso 
 | to_null.pl > tmp

The tmp file produced by that command looks like this:

str1 2001-01-20 97
str2 2002-02-28 N
N 2002-03-01 64

Using load_diag.pl to check the new file produces the following result:

% load_diag.pl cookbook diag_test tmp
File loaded with no warnings, no per-record tests performed

This indicates that if you load tmp into the diag_test table, you should get good results, and indeed that is true:

mysql> LOAD DATA LOCAL INFILE 'tmp' INTO TABLE diag_test;
Records: 3 Deleted: 0 Skipped: 0 Warnings: 0

Clearly, that's a lot of messing around just to make a three-line file load into MySQL better. But the point of the example is to illustrate that the feedback load_diag.pl provides can help you figure out what's wrong with a datafile so that you can clean it up.

In addition to the required arguments that name the database, table, and datafile, load_diag.pl understands several options:

--columns= name1,name2,name3,...

By default, load_diag.pl assumes the datafile contains columns that correspond in number and order to the columns in the table. If that is not true, use this option to specify the names of the columns that are present in the file, and in what order.

--labels

This option indicates that the datafile contains an initial row of labels that should be skipped. (Loading labels into a table typically results in spurious warnings.)

--skip-full-load

Skip the initial test that loads the entire datafile.

--tmp-table= tbl_name

Specify the name to use for the temporary table. The default is _load_diag_n, where n is load_diag.pl's process ID.

If necessary, you can also specify standard connection parameter options like --user or --host. Any options must precede the database name argument.

Use of load_diag.pl is subject to the following constraints and limitations:

The input must be in tab-delimited, linefeed-terminated format.
Record loading is performed with the LOCAL option of the LOAD DATA statement. LOCAL capability requires MySQL 3.22.15 or higher (and, as of 3.23.49, requires that your MySQL distribution not have been built with that capability disabled).
When load_diag.pl creates the temporary table, it omits any indexes that are present in the original table. This results in faster record loading time (particularly for the initial test that loads the entire datafile). On the other hand, not using indexes means that load_diag.pl won't find warnings that result from duplicate key values on unique indexes.

Returning to the example with which this section began, what about all those warnings that resulted from loading the managers.txt file into the managers table? load_diag.pl identifies them all as being due to missing or empty columns at the end of some of the lines:

% load_diag.pl --labels cookbook managers managers.txt
line 2: 2 warnings
 column 14 (postwins): bad value = ( ) (inserting NULL worked better)
 column 15 (postlosses): bad value = ( ) (inserting NULL worked better)
line 3: 2 warnings
 column 14 (postwins): bad value = ( ) (inserting NULL worked better)
 column 15 (postlosses): bad value = ( ) (inserting NULL worked better)
...
line 2839: 2 warnings
 column 14 (postwins): bad value = ( ) (inserting NULL worked better)
 column 15 (postlosses): bad value = ( ) (inserting NULL worked better)
line 2842: 2 warnings
 column 14 (postwins): bad value = ( ) (inserting NULL worked better)
 column 15 (postlosses): bad value = ( ) (inserting NULL worked better)

Number of lines in file: 2842
Warnings found when loading entire file: 5082
Lines containing too few column values: 416
Lines containing excess column values: 0

Warnings per column:

Column Times Total Warnings for Improved
 missing warnings empty columns with NULL
lahmanid 0 0 0 0
year 0 0 0 0
team 0 0 0 0
lg 0 0 0 0
div 0 0 0 0
g 0 0 0 0
w 0 0 0 0
l 0 0 0 0
pct 0 0 0 0
std 0 0 0 0
half 0 0 0 0
mgrorder 0 0 0 0
plyrmgr 16 0 0 0
postwins 416 2533 2533 2533
postlosses 416 2533 2533 2533

From this result, we can determine that 416 lines were missing the postwins and postlosses columns (and 16 of those were missing the plyrmgr column as well). The remaining errors were due to lines for which the postwins and postlosses columns were present but empty. The entire-file warning count of 5082 can be accounted for as the the number of plyrmgr values that were missing, plus the total warnings from the postwins and postlosses columns (16+2533+2533 = 5082).

The Total warnings value for the plyrmgr column is zero because it's a CHAR column, and thus loading empty values into it is legal. The Total warnings value for postwins and postlosses is nonzero because they are INT columns and loading empty values into them result in a conversion-to-zero operations. All of these problems are of the sort that can be made to go away by converting empty or missing values to N. Run the file through yank_col.pl to force each line to have 15 columns, and run the result through to_null.pl to convert empty values to N:

% yank_col.pl --columns=1-15 managers.txt | to_null.pl > tmp

Then see what load_diag.pl has to say about the resulting file:

% load_diag.pl --labels cookbook managers tmp
File loaded with no warnings, no per-record tests performed

If you load tmp into the managers table, no problems should occur:

mysql> LOAD DATA LOCAL INFILE 'tmp' INTO TABLE managers IGNORE 1 LINES;
Query OK, 2841 rows affected (0.13 sec)
Records: 2841 Deleted: 0 Skipped: 0 Warnings: 0

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References