Writing Date-Processing Utilities | Importing and Exporting Data

10.32.1 Problem

There's a given date-processing operation that you need to perform frequently, so you want to write a utility that does it for you.

10.32.2 Solution

The utilities in this section provide some examples showing how to do that.

10.32.3 Discussion

Due to the idiosyncratic nature of dates, you'll probably find it necessary to write date converters from time to time. This section shows some sample converters that serve various purposes:

isoize_date.pl reads a file looking for dates in U.S. format (MM-DD-YY) and converts them to ISO format.
cvt_date.pl converts dates to and from any of ISO, US, or British formats. It is more general than isoize_date.pl, but requires that you tell it what kind of input to expect and what kind of output to produce.
monddccyy_to_iso.pl looks for dates like Feb. 6, 1788 and converts them to ISO format. It illustrates how to map dates with non-numeric parts to a format that MySQL will understand.

All three scripts are located in the transfer directory of the recipes distribution. They assume datafiles are in tab-delimited, linefeed-terminated format. (Use cvt_file.pl to work with files in a different format.)

Our first date-processing utility, isoize_date.pl, looks for dates in U.S. format and rewrites them into ISO format. You'll recognize that it's modeled after the general input-processing loop shown in Recipe 10.21, with some extra stuff thrown in to perform a specific type of conversion:

#! /usr/bin/perl -w
# isoize_date.pl - Read input data, look for values that match
# a date pattern, convert them to ISO format. Also converts
# 2-digit years to 4-digit years, using a transition point of 70.

# By default, this looks for dates in MM-DD-[CC]YY format.

# Assumes tab-delimited, linefeed-terminated input lines.

# Does not check whether dates actually are valid (for example,
# won't complain about 13-49-1928).

use strict;

# transition point at which 2-digit years are assumed to be 19XX
# (below they are treated as 20XX)
my $transition = 70;

while (<>)
{
 chomp;
 my @val = split (/	/, $_, 10000); # split, preserving all fields
 for my $i (0 .. @val - 1)
 {
 my $val = $val[$i];
 # look for strings in MM-DD-[CC]YY format
 next unless $val =~ /^(d{1,2})D(d{1,2})D(d{2,4})$/;

 my ($month, $day, $year) = ($1, $2, $3);
 # to interpret dates as DD-MM-[CC]YY instead, replace preceding
 # line with the following one:
 #my ($day, $month, $year) = ($1, $2, $3);

 # convert 2-digit years to 4 digits, then update value in array
 $year += ($year >= $transition ? 1900 : 2000) if $year < 100;
 $val[$i] = sprintf ("%04d-%02d-%02d", $year, $month, $day);
 }
 print join ("	", @val) . "
";
}

exit (0);

If you feed isoize_date.pl an input file that looks like this:

Fred 04-13-70
Mort 09-30-69
Brit 12-01-57
Carl 11-02-73
Sean 07-04-63
Alan 02-14-65
Mara 09-17-68
Shepard 09-02-75
Dick 08-20-52
Tony 05-01-60

It produces the following output:

Fred 1970-04-13
Mort 2069-09-30
Brit 2057-12-01
Carl 1973-11-02
Sean 2063-07-04
Alan 2065-02-14
Mara 2068-09-17
Shepard 1975-09-02
Dick 2052-08-20
Tony 2060-05-01

isoize_date.pl serves a specific purpose: It converts only from U.S. to ISO format. It does not perform validity checking on date subparts or allow the transition point for adding the century to be specified. A more general tool would be more useful. The next script, cvt_date.pl, extends the capabilities of isoize_date.pl; it recognizes input dates in ISO, US, or British formats and converts any of them to any other. It also can convert two-digit years to four digits, allows you to specify the conversion transition point, and can warn about bad dates. As such, it can be used to preprocess input for loading into MySQL, or for postprocessing data exported from MySQL for use by other programs.

cvt_date.pl understands the following options:

--iformat= format, --oformat= format, --format= format,

Set the date format for input, output, or both. The default format value is iso; cvt_date.pl also recognizes any string beginning with us or br as indicating U.S. or British date format.

--add-century

Convert two-digit years to four digits.

--columns= column_list

Convert dates only in the named columns. By default, cvt_date.pl looks for dates in all columns. If this option is given, column_list should be a list of one or more column positions separated by commas. Positions begin at 1.

--transition= n

Specify the transition point for two-digit to four-digit year conversions. The default transition point is 70. This option turns on --add-century.

--warn

Warn about bad dates. (Note that this option can produce spurious warnings if the dates have two-digit years and you don't specify --add-century, because leap year testing won't always be accurate in that case.)

I won't show the code for cvt_date.pl here (most of it is taken up with processing command-line options), but you can examine the source for yourself if you like. As an example of how cvt_date.pl works, suppose you have a file newdata.txt with the following contents:

name1 01/01/99 38
name2 12/31/00 40
name3 02/28/01 42
name4 01/02/03 44

Running the file through cvt_date.pl with options indicating that the dates are in U.S. format and that the century should be added produces this result:

% cvt_date.pl --iformat=us --add-century newdata.txt
name1 1999-01-01 38
name2 2000-12-31 40
name3 2001-02-28 42
name4 2003-01-02 44

To produce dates in British format instead with no year conversion, do this:

% cvt_date.pl --iformat=us --oformat=br newdata.txt
name1 01-01-99 38
name2 31-12-00 40
name3 28-02-01 42
name4 02-01-03 44

cvt_date.pl has no knowledge of the meaning of each data column, of course. If you have a non-date column with values that match the pattern, it will rewrite that column, too. To deal with that, specify a --columns option to limit the columns that cvt_date.pl attempts to convert.

isoize_date.pl and cvt_date.pl both operate on dates written in all-numeric formats. But dates in datafiles often are written differently, in which case it may be necessary to write a special purpose script to process them. Suppose an input file contains dates in the following format (these represent the dates on which U.S. states were admitted to the Union):

Delaware Dec. 7, 1787
Pennsylvania Dec 12, 1787
New Jersey Dec. 18, 1787
Georgia Jan. 2, 1788
Connecticut Jan. 9, 1788
Massachusetts Feb. 6, 1788
Maryland Apr. 28, 1788
South Carolina May 23, 1788
New Hampshire Jun. 21, 1788
Virginia Jun 25, 1788
...

The dates consist of a three-character month abbreviation (possibly followed by a period), the numeric day of the month, a comma, and the numeric year. To import this file into MySQL, you'd need to convert the dates to ISO format, resulting in a file that looks like this:

Delaware 1787-12-07
Pennsylvania 1787-12-12
New Jersey 1787-12-18
Georgia 1788-01-02
Connecticut 1788-01-09
Massachusetts 1788-02-06
Maryland 1788-04-28
South Carolina 1788-05-23
New Hampshire 1788-06-21
Virginia 1788-06-25
...

That's a somewhat specialized kind of transformation, though this general type of problem (converting a specific date format) is hardly uncommon. To perform the conversion, identify the dates as those values matching an appropriate pattern, map month names to the corresponding numeric values, and reformat the result. The following script, monddccyy_to_iso.pl, illustrates how to do this:

#! /usr/bin/perl -w
# monddccyy_to_iso.pl - convert dates from mon[.] dd, ccyy to ISO format

# Assumes tab-delimited, linefeed-terminated input

use strict;

my %map = # map 3-char month abbreviations to numeric month
(
 "jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,
 "jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12
);

while (<>)
{
 chomp;
 my @val = split (/	/, $_, 10000); # split, preserving all fields
 for my $i (0 .. @val - 1)
 {
 # reformat the value if it matches the pattern, otherwise assume
 # it's not a date in the required format and leave it alone
 if ($val[$i] =~ /^([^.]+).? (d+), (d+)$/)
 {
 # use lowercase month name
 my ($month, $day, $year) = (lc ($1), $2, $3);
 if (exists ($map{$month}))
 {
 $val[$i] = sprintf ("%04d-%02d-%02d",
 $year, $map{$month}, $day);
 }
 else
 {
 # warn, but don't reformat
 warn "$val[$i]: bad date?
";
 }
 }
 }
 print join ("	", @val) . "
";
}

exit (0);

The script only does reformatting, it doesn't validate the dates. To do that, modify the script to use the Cookbook_Utils.pm module by adding this statement after the use strict line:

use Cookbook_Utils;

That gives the script access to the module's is_valid_date( ) routine. To use it, change the reformatting section of the script to look like this:

if (exists ($map{$month})
 && is_valid_date ($year, $map{$month}, $day))
{
 $val[$i] = sprintf ("%04d-%02d-%02d",
 $year, $map{$month}, $day);
}
else
{
 # warn, but don't reformat
 warn "$val[$i]: bad date?
";
}

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References