Regular expressions are often messy and confusing. There's no denying itit's true. One reason that regular expressions are confusing is that they have a very compact and visually distracting appearance. They are a "little language" unto themselves . However, this little language isn't made up of words like foreach and while . Instead, it uses atoms and operators like \w , [a-z] and + . Another reason is that what regular expressions do can be confusing in and of itself. Ordinary programming chores generally translate more or less directly into code. You might think "count from 1 to 10" and write for $i (1..10) { print "$i\n" } . But a regular expression that accomplishes a particular task may not look a whole lot like a series of straightforward instructions. You might think "find me a single-quoted string" and wind up with something like /'(?:\\'.)*?'/ . It's a good idea to try to make regular expressions more readable, especially if you intend to share your programs with others, or if you plan to work on them some more yourself at a later date. Of course, trying to keep regular expressions simple is a start, but there are a couple of Perl features you can use that will help you make even complex regular expressions more understandable. Use /x to add whitespace to regular expressionsNormally, whitespace encountered in a regular expression is significant:
The /x flag, which can be applied to both pattern matches and substitutions, causes the regular expression parser to ignore whitespace (so long as it isn't preceded by a backslash, or isn't inside a character class), including comments:
This can be especially helpful when a regular expression includes a complex alternation :
Break complex regular expressions into piecesAs you saw in Item 15, regular expressions are subject to double-quote interpolation. You can use this feature to write regular expressions that are built up with variables . In some cases, this may make them easier to read:
The pattern this example creates is /([0-9]+ [a-zA-Z_]+ [ ]+)/gxo . We used the /o ("compile once") flag, because there is no need for Perl to compile this regular expression more than once. Notice that there weren't any backslashes in the example. It's hard to avoid using backslashes in more complex patterns. However, because of the way Perl handles backslashes and character escapes in strings (and regular expressions), backslashes must be doubled to work properly:
The pattern this example creates is /(\d+ \w+ \ +)/gxo . If we want a literal backslash in a regular expression, it has to be backslashed (e.g., /\\/ matches a single backslash). Because backslashes in variables have to be doubled, this can result in some ugly looking strings '\\\\' to match a backslash and '\\\\\\w' to match a backslash followed by a \w character. This is not going to make our regular expressions more readable in any obvious way, so when dealing with subpatterns containing backslashes, it's wise to make up some strings in variables to hide this ugliness. Let's rewrite the double-quoted string example from above, this time using some variables: [7]
If you are curious as to exactly what a regular expression built up in this manner looks like, print it out. Here's one way: Continued from above:
This will print: /( " ( \\W \x[0-9a-fA-F]{2} \[0-3]?[0-7]?[0-7] [^"\] )* " )/xo; This is a fairly straightforward example of using variables to construct regular expressions. See the Hip Owls book for a much more complex examplea regular expression that can parse an RFC822 address. |