Recipe 10.21. Validating and Transforming Data

Problem

You need to make sure that the data values contained in a file are legal.

Solution

Check them, possibly rewriting them into a more suitable format.

Discussion

Earlier recipes in this chapter show how to work with the structural characteristics of files, by reading lines and breaking them up into separate columns. It's important to be able to do that, but sometimes you need to work with the data content of a file, not just its structure:

It's often a good idea to validate data values to make sure they're legal for the data types into which you're storing them. For example, you can make sure that values intended for INT, DATE, and ENUM columns are integers, dates in CCYY-MM-DD format, and legal enumeration values, respectively.
Data values may need reformatting. Rewriting dates from one format to another is especially common; for example, if a program writes dates in MM-DD-YY format to ISO format for import into MySQL. If a program understands only date and time formats and not a combined date-and-time format (such as MySQL uses for the DATETIME and TIMESTAMP data types), you need to split date-and-time values into separate date and time values.
It may be necessary to recognize special values in the file. It's common to represent NULL with a value that does not otherwise occur in the file, such as -1, Unknown, or N/A. If you don't want those values to be imported literally, you need to recognize and handle them specially.

This is the first of a set of recipes that describe validation and reformatting techniques that are useful in these kinds of situations. Techniques covered here for checking values include pattern matching and validation against information in a database. It's not unusual for certain validation operations to come up over and over, in which case you'll probably find it useful to to construct a library of functions. By packaging validation operations as library routines, it is easier to write utilities based on them, and the utilities make it easier to perform command-line operations on entire files so that you can avoid editing them yourself.

Server-Side Versus Client-Side Validation

As described in Section 10.20, you can cause data validation to be done on the server side by setting the SQL mode to be restrictive about accepting bad input data. In this case, the MySQL server raises an error for values that aren't legal for the data types of the columns into which you insert them.

In the next few sections, the focus is validation on the client side rather than on the server side. Client-side validation can be useful when you require more control over validation than simply receiving an error from the server. (For example, if you test values yourself, it's often easier to provide more informative messages about the exact nature of problems with the values.) Also, it might be necessary to couple validation with reformatting to transform complex values so that they are compatible with MySQL data types. You have more flexibility to do this on the client side.

If you want to avoid writing your own library routines, look around to see if someone else has already written suitable routines that you can use. For example, if you check the Perl CPAN (cpan.perl.org), you'll find a Data::Validate module hierarchy. The modules there provide library routines that standardize a number of common validation tasks. Data::Validate::MySQL deals specifically with MySQL data types.

Writing an input-processing loop

Many of the validation recipes shown in the new few sections are typical of those that you perform within the context of a program that reads a file and checks individual column values. The general framework for such a file-processing utility can be written like this:

#!/usr/bin/perl # loop.pl - Typical input-processing loop # Assumes tab-delimited, linefeed-terminated input lines. use strict; use warnings; while (<>)                # read each line {   chomp;   # split line at tabs, preserving all fields   my @val = split (/\t/, $_, 10000);   for my $i (0 .. @val - 1) # iterate through columns in line   {     # ... test $val[$i] here ...   } }

The while⁠(⁠ ⁠ ⁠) loop reads each input line and breaks it into fields. Inside the loop, each line is broken into fields. Then the inner for⁠(⁠ ⁠ ⁠) loop iterates through the fields in the line, allowing each to be processed in sequence. If you're not applying a given test uniformly to all the fields, replace the for⁠(⁠ ⁠ ⁠) loop with separate column-specific tests.

This loop assumes tab-delimited, linefeed-terminated input, an assumption that is shared by most of the utilities discussed throughout the rest of this chapter. To use these programs with datafiles in other formats, you may be able to convert the files into tab-delimited format using the cvt_file.pl script discussed in Section 10.18.

Putting common tests in libraries

For a test that you perform often, it may be useful to package it as a library function. This makes the operation easy to perform and also gives it a name that's likely to make the meaning of the operation clearer than the comparison code itself. For example, the following test performs a pattern match to check that $val consists entirely of digits (optionally preceded by a plus sign), and then makes sure the value is greater than zero:

$valid = ($val =~ /^\+?\d+$/ && $val > 0);

In other words, the test looks for strings that represent positive integers. To make the test easier to use and its intent clearer, you might put it into a function that is used like this:

$valid = is_positive_integer ($val);

The function itself can be defined as follows:

sub is_positive_integer { my $s = $_[0];   return ($s =~ /^\+?\d+$/ && $s > 0); }

Now put the function definition into a library file so that multiple scripts can use it easily. The Cookbook_Utils.pm module file in the lib directory of the recipes distribution is an example of a library file that contains a number of validation functions. Take a look through it to see which functions may be useful in your own programs (or as a model for writing your own library files). To gain access to this module from within a script, include a use statement like this:

use Cookbook_Utils;

You must of course install the module file in a directory where Perl will find it. For details on library installation, see Section 2.3.

A significant benefit of putting a collection of utility routines into a library file is that you can use it for all kinds of programs. It's rare for a data manipulation problem to be completely unique. If you can pick and choose at least a few validation routines from a library, it's possible to reduce the amount of code you need to write, even for highly specialized programs.