Recipe 10.31. Writing Date-Processing Utilities


Problem

There's a given date-processing operation that you need to perform frequently, so you want to write a utility that does it for you.

Solution

The utilities in this recipe provide some examples that show how to do that.

Discussion

Due to the idiosyncratic nature of dates, you might find it necessary to write date converters from time to time. This section shows some sample converters that serve various purposes:

  • isoize_date.pl reads a file looking for dates in U.S. format (MM-DD-YY) and converts them to ISO format.

  • cvt_date.pl converts dates to and from any of ISO, U.S., or British formats. It is more general than isoize_date.pl, but requires that you tell it what kind of input to expect and what kind of output to produce.

  • monddccyy_to_iso.pl looks for dates like Feb. 6, 1788 and converts them to ISO format. It illustrates how to map dates with nonnumeric parts to a format that MySQL will understand.

All three scripts are located in the transfer directory of the recipes distribution. They assume datafiles are in tab-delimited, linefeed-terminated format. (Use cvt_file.pl first if you need to work with files that have a different format.)

Our first date-processing utility, isoize_date.pl, looks for dates in U.S. format and rewrites them into ISO format. You'll recognize that it's modeled after the general input-processing loop shown in Section 10.21, with some extra stuff thrown in to perform a specific type of conversion:

#!/usr/bin/perl # isoize_date.pl - Read input data, look for values that match # a date pattern, convert them to ISO format. Also converts # 2-digit years to 4-digit years, using a transition point of 70. # By default, this looks for dates in MM-DD-[CC]YY format. # Assumes tab-delimited, linefeed-terminated input lines. # Does not check whether dates actually are valid (for example, # won't complain about 13-49-1928). use strict; use warnings; # transition point at which 2-digit years are assumed to be 19XX # (below they are treated as 20XX) my $transition = 70; while (<>) {   chomp;   my @val = split (/\t/, $_, 10000);  # split, preserving all fields   for my $i (0 .. @val - 1)   {     my $val = $val[$i];     # look for strings in MM-DD-[CC]YY format     next unless $val =~ /^(\d{1,2})\D(\d{1,2})\D(\d{2,4})$/;     my ($month, $day, $year) = ($1, $2, $3);     # to interpret dates as DD-MM-[CC]YY instead, replace preceding     # line with the following one:     #my ($day, $month, $year) = ($1, $2, $3);     # convert 2-digit years to 4 digits, and then update value in array     $year += ($year >= $transition ? 1900 : 2000) if $year < 100;     $val[$i] = sprintf ("%04d-%02d-%02d", $year, $month, $day);   }   print join ("\t", @val) . "\n"; } 

If you feed isoize_date.pl an input file that looks like this:

Fred    04-13-70 Mort    09-30-69 Brit    12-01-57 Carl    11-02-73 Sean    07-04-63 Alan    02-14-65 Mara    09-17-68 Shepard 09-02-75 Dick    08-20-52 Tony    05-01-60 

It produces the following output:

Fred    1970-04-13 Mort    2069-09-30 Brit    2057-12-01 Carl    1973-11-02 Sean    2063-07-04 Alan    2065-02-14 Mara    2068-09-17 Shepard 1975-09-02 Dick    2052-08-20 Tony    2060-05-01 

isoize_date.pl serves a specific purpose: it converts only from U.S. to ISO format. It does not perform validity checking on date subparts or allow the transition point for adding the century to be specified. A more general tool would be more useful. The next script, cvt_date.pl, extends the capabilities of isoize_date.pl; it recognizes input dates in ISO, U.S., or British formats and converts any of them to any other. It also can convert two-digit years to four digits, enables you to specify the conversion transition point, and can warn about bad dates. As such, it can be used to preprocess input for loading into MySQL, or to postprocess data exported from MySQL for use by other programs.

cvt_date.pl understands the following options:


--iformat = format, --oformat = format, --format = format

Set the date format for input, output, or both. The default format value is iso; cvt_date.pl also recognizes any string beginning with us or br as indicating U.S. or British date format.


--add-century

Convert two-digit years to four digits.


--columns = column_list

Convert dates only in the named columns. By default, cvt_date.pl looks for dates in all columns. If this option is given, column_list should be a list of one or more column positions or ranges separated by commas. (Ranges can be given as m-n to specify columns m tHRough n.) Positions begin at 1.


--transition = n

Specify the transition point for two-digit to four-digit year conversions. The default transition point is 70. This option turns on --add-century.


--warn

Warn about bad dates. (Note that this option can produce spurious warnings if the dates have two-digit years and you don't specify --add-century, because leap year testing won't always be accurate in that case.)

I won't show the code for cvt_date.pl here (most of it is taken up with processing command-line options), but you can examine the source for yourself if you like. As an example of how cvt_date.pl works, suppose that you have a file newdata.txt with the following contents:

name1   01/01/99    38 name2   12/31/00    40 name3   02/28/01    42 name4   01/02/03    44 

Running the file through cvt_date.pl with options indicating that the dates are in U.S. format and that the century should be added produces this result:

% cvt_date.pl --iformat=us --add-century newdata.txt name1   1999-01-01  38 name2   2000-12-31  40 name3   2001-02-28  42 name4   2003-01-02  44 

To produce dates in British format instead with no year conversion, do this:

% cvt_date.pl --iformat=us --oformat=br newdata.txt name1   01-01-99    38 name2   31-12-00    40 name3   28-02-01    42 name4   02-01-03    44 

cvt_date.pl has no knowledge of the meaning of each data column, of course. If you have a nondate column with values that match the pattern, it will rewrite that column, too. To deal with that, specify a --columns option to limit the columns that cvt_date.pl attempts to convert.

isoize_date.pl and cvt_date.pl both operate on dates written in all-numeric formats. But dates in datafiles often are written differently, in which case it may be necessary to write a special purpose script to process them. Suppose an input file contains dates in the following format (these represent the dates on which U.S. states were admitted to the Union):

Delaware        Dec. 7, 1787 Pennsylvania    Dec 12, 1787 New Jersey      Dec. 18, 1787 Georgia         Jan. 2, 1788 Connecticut     Jan. 9, 1788 Massachusetts   Feb. 6, 1788 Maryland        Apr. 28, 1788 South Carolina  May 23, 1788 New Hampshire   Jun. 21, 1788 Virginia        Jun 25, 1788 ... 

The dates consist of a three-character month abbreviation (possibly followed by a period), the numeric day of the month, a comma, and the numeric year. To import this file into MySQL, you need to convert the dates to ISO format, resulting in a file that looks like this:

Delaware        1787-12-07 Pennsylvania    1787-12-12 New Jersey      1787-12-18 Georgia         1788-01-02 Connecticut     1788-01-09 Massachusetts   1788-02-06 Maryland        1788-04-28 South Carolina  1788-05-23 New Hampshire   1788-06-21 Virginia        1788-06-25 ... 

That's a somewhat specialized kind of transformation, although this general type of problem (converting a particular date format to ISO format) is hardly uncommon. To perform the conversion, identify the dates as those values matching an appropriate pattern, map month names to the corresponding numeric values, and reformat the result. The following script, monddccyy_to_iso.pl, illustrates how to do this:

#!/usr/bin/perl # monddccyy_to_iso.pl - convert dates from mon[.] dd, ccyy to ISO format # Assumes tab-delimited, linefeed-terminated input use strict; use warnings; my %map =   # map 3-char month abbreviations to numeric month (   "jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,   "jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12 ); while (<>) {   chomp;   my @val = split (/\t/, $_, 10000);    # split, preserving all fields   for my $i (0 .. @val - 1)   {     # reformat the value if it matches the pattern, otherwise assume     # that it's not a date in the required format and leave it alone     if ($val[$i] =~ /^([^.]+)\.? (\d+), (\d+)$/)     {       # use lowercase month name       my ($month, $day, $year) = (lc ($1), $2, $3);       if (exists ($map{$month}))       {         $val[$i] = sprintf ("%04d-%02d-%02d",                   $year, $map{$month}, $day);       }       else       {         # warn, but don't reformat         warn "$val[$i]: bad date?\n";       }     }   }   print join ("\t", @val) . "\n"; } 

The script only does reformatting, it doesn't validate the dates. To do that, modify the script to use the Cookbook_Utils.pm module by adding this statement after the use warnings line:

use Cookbook_Utils; 

That gives the script access to the module's is_valid_date⁠(⁠ ⁠ ⁠) routine. To use it, change this line:

if (exists ($map{$month})) 

To this:

if (exists ($map{$month}) && is_valid_date ($year, $map{$month}, $day)) 




MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2004
Pages: 375
Authors: Paul DuBois

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net