Flylib.com

Books Software

 
 
 

Section 12.11. Properties


12.11. Properties

Prefer properties to enumerated character classes .

Explicit character classes are frequently used to match character ranges, especially alphabetics. For example:



# Alphabetics-only identifier...


Readonly my $ALPHA_IDENT => qr/ [A-Z] [A-Za-z]* /xms;

However, a character class like that doesn't actually match all possible alphabetics. It matches only ASCII alphabetics. It won't recognize the common Latin-1 variants, let alone the full gamut of Unicode alphabetics.

That result might be okay, if you're sure your data will never be other than parochial, but in today's post-modern, multicultural, outsourced world it's rather dclass for an berhacking r nin to create identifier regexes that wont even match 'dclass' or 'berhacking' or 'r *nin .

Regular expressions in Perl 5.6 and later [*] support the use of the \p{...} escape, which allows you to use full Unicode properties . Properties are Unicode-compliant named character classes and are both more general and more self-documenting than explicit ASCII character classes. The perlunicode manpage explains the mechanism in detail and lists the available properties.

[*] Perl's Unicode support was still highly experimental in the 5.6 releases, and has improved considerably since then. If you're intending to make serious use of Unicode in production code, you really need to be running the latest 5.8.X release you can, and at very least Perl 5.8.1.

So, if you're ready to concede that ASCII-centrism is a nave faade that's gradually fading into Gtterdmmerung, you might choose to bid it adis and open your regexes to the full Unicode smrgsbord, by changing the previous identifier regex to:


Readonly my $ALPHA_IDENT => qr/ \p{Uppercase}  \p{Alphabetic}* /xms;


There are even properties to help create identifiers that follow the normal Perl conventions but are still language-independent. Instead of:

Readonly my $PERL_IDENT => qr/ [A-Za-z_] \w*/xms;

you can use:


Readonly my $PERL_IDENT => qr/ \p{ID_Start} \p{ID_Continue}* /xms;


One other particularly useful property is \p{Any} , which provides a more readable alternative to the normal dot ( . ) metacharacter. For example, instead of:


m/ [{] . [.] \d{2} [}] /xms;


you could write:


m/ [{] \p{Any} [.] \d{2} [}] /xms;


and leave the reader in no doubt that the second character to be matched really can be anything at allan ASCII alphabetic, a Latin-1 superscript, an Extended Latin diacritical , a Devanagari number, an Ogham rune, or even a Bopomofo symbol.


12.12. Whitespace

Consider matching arbitrary whitespace, rather than specific whitespace characters .

Unless you're matching regular expressions against fixed-format machine-generated data, avoid matching specific whitespace characters exactly. Because if humans were directly involved anywhere in the data acquisition, then the notion of "fixed" will probably have been more honoured in the breach than in the observance.

If, for example, the input is supposed to consist of a label, followed by a single space, followed by an equals sign, followed by a single space, followed by an value...don't bet on it. Most users nowadays willquite reasonablyassume that whitespace is negotiable; nothing more than an elastic formatting medium. So, in a configuration file, you're just as likely to get something like:

name       = Yossarian, J
    rank       = Captain
    serial_num = 3192304

The whitespaces in that data might be single tabs, multiple tabs, multiple spaces, single spaces, or any combination thereof. So matching that data with a pattern that insists on exactly one space character at the relevant points is unlikely to be uniformly successful:

$config_line =~ m{ ($IDENT)  [\N{SPACE}]  =  [\N{SPACE}]  (.*) }xms

Worse still, it's also unlikely to be uniformly unsuccessful . For instance, in the example data, it might only match the serial number. And that kind of intermittent success will make your program much harder to debug. It might also make it difficult to realize that any debugging is required.

Unless you're specifically vetting data to verify that it conforms to a required fixed format, it's much better to be very liberal in what you accept when it comes to whitespace. Use \s+ for any required whitespace and \s* for any optional whitespace. For example, it would be far more robust to match the example data against:


$config_line =~ m{ ($IDENT)  \s*  =  \s*  (.*) }xms