12.11. Properties
Explicit character classes are frequently used to match character ranges,
# Alphabetics-only identifier... Readonly my $ALPHA_IDENT => qr/ [A-Z] [A-Za-z]* /xms;
However, a character class like that doesn't actually match all possible alphabetics. It matches only ASCII alphabetics. It won't recognize the common Latin-1 variants, let alone the full
That result might be okay, if you're sure your data will never be other than parochial, but in today's post-modern, multicultural, outsourced world it's rather dclass for an berhacking r
Regular expressions in Perl 5.6 and later
[*]
support the use of the
\p{...}
escape, which allows you to use full Unicode
properties
. Properties are Unicode-compliant named character classes and are both more general and more
So, if you're ready to concede that ASCII-centrism is a nave faade that's gradually fading into Gtterdmmerung, you might choose to bid it adis and
Readonly my $ALPHA_IDENT => qr/ \p{Uppercase} \p{Alphabetic}* /xms;
There are even properties to help create identifiers that follow the normal Perl conventions but are still language-independent. Instead of: Readonly my $PERL_IDENT => qr/ [A-Za-z_] \w*/xms; you can use:
Readonly my $PERL_IDENT => qr/ \p{ID_Start} \p{ID_Continue}* /xms;
One other particularly useful property is \p{Any} , which provides a more readable alternative to the normal dot ( . ) metacharacter. For example, instead of:
m/ [{] . [.] \d{2} [}] /xms;
you could write:
m/ [{] \p{Any} [.] \d{2} [}] /xms;
and leave the reader in no doubt that the second character to be matched really can be anything at allan ASCII alphabetic, a Latin-1 superscript, an Extended Latin
|
12.12. Whitespace
Unless you're matching regular expressions against
If, for example, the input is supposed to consist of a label, followed by a single space, followed by an equals sign, followed by a single space, followed by an value...don't bet on it. Most users nowadays willquite reasonablyassume that whitespace is negotiable; nothing more than an elastic formatting medium. So, in a configuration file, you're just as likely to get something like:
name = Yossarian, J
rank = Captain
serial_num = 3192304
The whitespaces in that data might be single tabs, multiple tabs, multiple spaces, single spaces, or any combination thereof. So matching that data with a pattern that insists on exactly one space character at the relevant points is
$config_line =~ m{ ($IDENT) [\N{SPACE}] = [\N{SPACE}] (.*) }xms
Worse still, it's also unlikely to be uniformly
Unless you're
$config_line =~ m{ ($IDENT) \s* = \s* (.*) }xms
|