Item 21: Make regular expressions readable.

Regular expressions are often messy and confusing. There's no denying itit's true.

One reason that regular expressions are confusing is that they have a very compact and visually distracting appearance. They are a "little language" unto themselves . However, this little language isn't made up of words like foreach and while . Instead, it uses atoms and operators like \w , [a-z] and + .

Another reason is that what regular expressions do can be confusing in and of itself. Ordinary programming chores generally translate more or less directly into code. You might think "count from 1 to 10" and write for $i (1..10) { print "$i\n" } . But a regular expression that accomplishes a particular task may not look a whole lot like a series of straightforward instructions. You might think "find me a single-quoted string" and wind up with something like /'(?:\\'.)*?'/ .

It's a good idea to try to make regular expressions more readable, especially if you intend to share your programs with others, or if you plan to work on them some more yourself at a later date. Of course, trying to keep regular expressions simple is a start, but there are a couple of Perl features you can use that will help you make even complex regular expressions more understandable.

Use `/x` to add whitespace to regular expressions

Normally, whitespace encountered in a regular expression is significant:

($a, $b, $c) = /^(\w+) (\w+) (\w+)/;	Find three words separated by one space, at start of `$_` .
$_ = "Testing one two";	`$_` contains embedded newlines (same as if we had used `\n` ).
s/ /<lf>/g;	Replace newlines with `<lf>`
print "$_\n";	Testing<lf>one<lf>two

The /x flag, which can be applied to both pattern matches and substitutions, causes the regular expression parser to ignore whitespace (so long as it isn't preceded by a backslash, or isn't inside a character class), including comments:

 ($str) = /( ' (?: \'  . )*? ' )/x;

Find a single-quoted string, including escaped quotes.

This can be especially helpful when a regular expression includes a complex alternation :

 ($str) = / (    " (?:      \\W  # special char      \x[0-9a-fA-F][0-9a-fA-F]  # hex      \[0-3]?[0-7]?[0-7]  # octal      [^"\] # ordinary char    )* "  ) /x;

Find a double-quoted string, including hex and octal escapes .

Break complex regular expressions into pieces

As you saw in Item 15, regular expressions are subject to double-quote interpolation. You can use this feature to write regular expressions that are built up with variables . In some cases, this may make them easier to read:

 $num = '[0-9]+';  $word = '[a-zA-Z_]+';  $space = '[ ]+';

Create some "subpatterns."

 $_ = "Testing 1 2 3";  @split = /($num  $word  $space)/gxo;  print join(":", @split), "\n";

Some sample data.

Match into an array.

Testing: :1: :2: :3

The pattern this example creates is /([0-9]+ [a-zA-Z_]+ [ ]+)/gxo . We used the /o ("compile once") flag, because there is no need for Perl to compile this regular expression more than once.

Notice that there weren't any backslashes in the example. It's hard to avoid using backslashes in more complex patterns. However, because of the way Perl handles backslashes and character escapes in strings (and regular expressions), backslashes must be doubled to work properly:

 $num = '\d+';  $word = '\w+';  $space = '\ +';

'\\d+' becomes the string '\d+' , etc.

 $_ = "Testing 1 2 3";  @split = /($num  $word  $space)/gxo;  print join(":", @split), "\n";

Some sample data.

Match into an array.

Testing: :1: :2: :3

The pattern this example creates is /(\d+ \w+ \ +)/gxo .

If we want a literal backslash in a regular expression, it has to be backslashed (e.g., /\\/ matches a single backslash). Because backslashes in variables have to be doubled, this can result in some ugly looking strings '\\\\' to match a backslash and '\\\\\\w' to match a backslash followed by a \w character. This is not going to make our regular expressions more readable in any obvious way, so when dealing with subpatterns containing backslashes, it's wise to make up some strings in variables to hide this ugliness. Let's rewrite the double-quoted string example from above, this time using some variables: ^[7]

^[7] If something like "${back}[0-3][0-7]{2}" worries you, feel free to write it as $back . "[0-3][0-7]{2}" .

 $back = '\\';

Pattern for backslash.

 $spec_ch = "$back\W";  $hex_ch = "${back}x[0-9a-fA-F]{2}";  $oct_ch = "${back}[0-3]?[0-7]?[0-7]";  $char = "[^\"$back]";

Escaped char like \" , \$ .

Hex escape: \xab .

Oct escape: \123 .

Ordinary char.

 ($str) = /(   " (    $spec_ch  $hex_ch  $oct_ch  $char   )* "  )/xo;

Here's the actual pattern match.

If you are curious as to exactly what a regular expression built up in this manner looks like, print it out. Here's one way:

Continued from above:

 print <<EOT;  /(   " (    $spec_ch  $hex_ch  $oct_ch  $char   )* "  )/xo;  EOT

Just wrap everything in a double-quoted here-doc string.

This will print:

 /(   " (    \\W  \x[0-9a-fA-F]{2}  \[0-3]?[0-7]?[0-7]  [^"\]   )* "  )/xo;

This is a fairly straightforward example of using variables to construct regular expressions. See the Hip Owls book for a much more complex examplea regular expression that can parse an RFC822 address.

Use /x to add whitespace to regular expressions

Break complex regular expressions into pieces

Use `/x` to add whitespace to regular expressions