Recipe 10.22. Using Pattern Matching to Validate Data


Problem

You need to compare a value to a set of values that is difficult to specify literally without writing a really ugly expression.

Solution

Use pattern matching.

Discussion

Pattern matching is a powerful tool for validation because it enables you to test entire classes of values with a single expression. You can also use pattern tests to break up matched values into subparts for further individual testing or in substitution operations to rewrite matched values. For example, you might break a matched date into pieces so that you can verify that the month is in the range from 1 to 12, and the day is within the number of days in the month. You might use a substitution to reorder MM-DD-YY or DD-MM-YY values into YY-MM-DD format.

The next few sections describe how to use patterns to test for several types of values, but first let's take a quick tour of some general pattern-matching principles. The following discussion focuses on Perl's regular expression capabilities. Pattern matching in Ruby, PHP, and Python is similar, although you should consult the relevant documentation for any differences. For Java, use the java.util.regex package.

In Perl, the pattern constructor is / pat /:

$it_matched = ($val =~ /pat/);    # pattern match 

Put an i after the / pat / constructor to make the pattern match case-insensitive:

$it_matched = ($val =~ /pat/i);   # case-insensitive match 

To use a character other than slash, begin the constructor with m. This can be useful if the pattern itself contains slashes:

$it_matched = ($val =~ m|pat|);   # alternate constructor character 

To look for a nonmatch, replace the =~ operator with the !~ operator:

$no_match = ($val !~ /pat/);      # negated pattern match 

To perform a substitution in $val based on a pattern match, use s/ pat / replacement /. If pat occurs within $val, it's replaced by replacement. To perform a case-insensitive match, put an i after the last slash. To perform a global substitution that replaces all instances of pat rather than just the first one, add a g after the last slash:

$val =~ s/pat/replacement/;    # substitution $val =~ s/pat/replacement/i;   # case-insensitive substitution $val =~ s/pat/replacement/g;   # global substitution $val =~ s/pat/replacement/ig;  # case-insensitive and global 

Here's a list of some of the special pattern elements available in Perl regular expressions:

PatternWhat the pattern matches
^ Beginning of string
$ End of string
. Any character
\s, \S Whitespace or nonwhitespace character
\d, \D Digit or nondigit character
\w, \W Word (alphanumeric or underscore) or non-word character
[...] Any character listed between the square brackets
[^...] Any character not listed between the square brackets
p1 | p2 | p3 Alternation; matches any of the patterns p1, p2, or p3
* Zero or more instances of preceding element
+ One or more instances of preceding element
{ n } n instances of preceding element
{ m , n } m through n instances of preceding element


Many of these pattern elements are the same as those available for MySQL's REGEXP regular expression operator (Section 5.11).

To match a literal instance of a character that is special within patterns, such as *, ^, or $, precede it with a backslash. Similarly, to include a character within a character class construction that is special in character classes ([, ], or -), precede it with a backslash. To include a literal ^ in a character class, list it somewhere other than as the first character between the brackets.

Many of the validation patterns shown in the following sections are of the form /^pat $/. Beginning and ending a pattern with ^ and $ has the effect of requiring pat to match the entire string that you're testing. This is common in data validation contexts, because it's generally desirable to know that a pattern matches an entire input value, not just part of it. (If you want to be sure that a value represents an integer, for example, it doesn't do you any good to know only that it contains an integer somewhere.) This is not a hard-and-fast rule, however, and sometimes it's useful to perform a more relaxed test by omitting the ^ and $ characters as appropriate. For example, if you want to strip leading and trailing whitespace from a value, use one pattern anchored only to the beginning of the string, and another anchored only to the end:

$val =~ s/^\s+//;   # trim leading whitespace $val =~ s/\s+$//;   # trim trailing whitespace 

That's such a common operation, in fact, that it's a good candidate for being written as a utility function. The Cookbook_Utils.pm file contains a function trim_whitespace⁠(⁠ ⁠ ⁠) that performs both substitutions and returns the result:

$val = trim_whitespace ($val); 

To remember subsections of a string that is matched by a pattern, use parentheses around the relevant parts of the pattern. After a successful match, you can refer to the matched substrings using the variables $1, $2, and so forth:

if ("abcdef" =~ /^(ab)(.*)$/) {   $first_part = $1; # this will be ab   $the_rest = $2;   # this will be cdef } 

To indicate that an element within a pattern is optional, follow it with a ? character. To match values consisting of a sequence of digits, optionally beginning with a minus sign, and optionally ending with a period, use this pattern:

/^-?\d+\.?$/ 

You can also use parentheses to group alternations within a pattern. The following pattern matches time values in hh:mm format, optionally followed by AM or PM:

/^\d{1,2}:\d{2}\s*(AM|PM)?$/i 

The use of parentheses in that pattern also has the side effect of remembering the optional part in $1. To suppress that side effect, use (?: pat ) instead:

/^\d{1,2}:\d{2}\s*(?:AM|PM)?$/i 

That's sufficient background in Perl pattern matching to allow construction of useful validation tests for several types of data values. The following sections provide patterns that can be used to test for broad content types, numbers, temporal values, and email addresses or URLs.

The transfer directory of the recipes distribution contains a test_pat.pl script that reads input values, matches them against several patterns, and reports which patterns each value matches. The script is easily extensible, so you can use it as a test harness to try your own patterns.




MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2004
Pages: 375
Authors: Paul DuBois

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net