12.20. Canned Regexes
Regular expressions are wonderfully easy to code wrongly: to miss edge-cases, to include unexpected (and incorrect) matches, or to create a pattern that's correct but hopelessly inefficient. And even when you get your regex right, you still have to maintain the code that you used to build it. It's a drag. Worse, it's everybody's drag. All around the world there are thousands of Perl programmers continually reinventing the same regexes: to match numbers, and URLs, and quoted strings, and programming language comments, and IP addresses, and Roman numerals, and zip codes, and Social Security numbers, and balanced brackets, and credit card numbers, and email addresses. Fortunately there's a CPAN module named Regexp::Common, whose entire purpose is to generate these kinds of everyday regular expressions for you. The module installs a single hash (%RE), tHRough which you can create thousands of commonly needed regexes. For example, instead of building yourself a number-matcher: # Build a regex that matches floating point representations... Readonly my $DIGITS => qr{ \d+ (?: [.] \d*)? | [.] \d+ }xms; Readonly my $SIGN => qr{ [+-] }xms; Readonly my $EXPONENT => qr{ [Ee] $SIGN? \d+ }xms; Readonly my $NUMBER => qr{ ( ($SIGN?) ($DIGITS) ($EXPONENT?) ) }xms; # and later... my ($number) = $input =~ $NUMBER; you can ask Regexp::Common to do it for you: use Regexp::Common; And instead of beating your head against the appalling regex needed to match formal HTTP-style URIs: # Build a regex that matches HTTP addresses... Readonly my $HTTP => qr{ (?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])* (?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.] [0-9]+[.][0-9]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9 \-_.!~*'( ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9 \-_.!~*'( ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9 \-_.!~*'( ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9 \-_.!~*'( ):@&=+\$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?] (?:(?:(?:[;/?:@&=+\$,a-zA-Z0-9\-_.!~*'( )]+|(?:%[a-fA-F0-9][a-fA-F0-9 ]))*)))?))?) }xms; # Find web pages... URI: while (my $uri = <>) { next URI if $uri !~ m/ $HTTP /xms; print $uri; } You can just use: use Regexp::Common; The benefits are perhaps most noticeable when you need a slight variation on a common regex, such as one that matches numbers in base 12, with between six and nine duodecimal places: use Regexp::Common; or a regular expression to help expurgate potentially rude words: use Regexp::Common; or a pattern that checks Australian postcodes: use Regexp::Common; use IO::Prompt; The regexes produced by Regexp::Common are reliable, robust, and efficient, because they're in wide and continual use (i.e., endlessly crash-tested), and they're regularly maintained and enhanced by some of the most competent developers in the Perl community. The module also has the most extensive test suite on the entire CPAN, with more than 175,000 tests. |