Item 16: Use regular expression memory.


Although regular expressions are handy for determining whether a string looks like one thing or another, their greatest utility is in helping parse the contents of strings once a match has been found. To break apart strings with regular expressions, you must use regular expression memory.

The memory variables : $1 , $2 , $3 , and so on

Most often, parsing with regular expressions involves the use of the regular expression memory variables $1 , $2 , $3 , and so on. Memory variables are associated with parentheses inside regular expressions. Each pair of parentheses in a regular expression "memorizes" what its contents matched. For example:

 $_ = 'http://www.perl.org/index.html';  m#^http://([^/]+)(.*)#; 

Memorize hostname and path following http:// .

 print "host = \n";  print "path = \n"; 

host = www.perl.org

path = /index.html

Only successful matches affect the memory variables. An unsuccessful match leaves the memory variables alone, even if it appears that part of a match might be succeeding:

Continued from above:

 $_ = 'ftp://ftp.uu.net/pub/';  m#^http://([^/]+)(.*)#; 

ftp doesn't match http .

Same pattern as above.

 print "host = \n";  print "path = \n"; 

Still www.perl.org .

Still /index.html .

When a pair of parentheses matches several different places in a string, the corresponding memory variable contains the last match:

 $_ = 'ftp://ftp.uu.net/pub/systems';  m#^ftp://([^/]+)(/[^/]*)+#; 

Last fragment of the path goes into $2 .

 print "host = \n";  print "fragment = \n"; 

host = ftp.uu.net fragment = /systems but matched /pub first.

In cases involving nested parentheses, count left parentheses to determine which memory variable a particular set of parentheses refers to:

 $_ = 'ftp://ftp.uu.net/pub/systems';  m#^ftp://([^/]+)((/[^/]*)+)#; 

This pattern is similar to the last one, but also collects the whole path.

 print "host = \n";  print "path = \n";  print "fragment = \n"; 

host = ftp.uu.net

path = /pub/systems

fragment = /systems

The "count left parentheses" rule applies to all regular expressions, even ones involving alternation :

 $_ = 'ftp://ftp.uu.net/pub';  m#^((http)(ftp)(file)):#; 

Just grab the first (protocol) portion of a URL.

 print "protocol = \n";  print "http = \n";  print "ftp = \n";  print "file = \n"; 

protocol = ftp

http =

ftp = ftp

file =

The $+ special variable contains the value of the last non-empty memory:

Continued from above:

 print "$+ = $+\n"; 

$+ = ftp

The parade of frills continues! Memory variables are automatically localized by each new scope. In a unique twist, the localized variables receive copies of the values from the outer scopethis is in contrast to the usual reinitializing of a localized variable:

 $_ = 'ftp://ftp.uu.net/pub';  m#^([^:]+)://(.*)#; 

Take a URL apart in two steps first, split off the protocol.

 print "$1, $2 = , \n"; 

$1, $2 = ftp, ftp.uu.net/pub

 {    print "$1, $2 = , \n";     =~ m#([^/]+)(.*)#;    print "$1, $2 = , \n";  } 

Now, split into host and path.

$1, $2 = ftp, ftp.uu.net/pub

$1, $2 = ftp.uu.net, /pub

 print "$1, $2 = , \n"; 

$1, $2 = ftp, ftp.uu.net/pub

The old $1 and $2 are back.

The localizing mechanism used is local , not my (see Item 23).

Backreferences

Regular expressions can make use of the contents of memories via backreferences . The atoms \1 , \2 , \3 , and so on match the contents of the corresponding memories. An obvious (but not necessarily useful) application of backreferences is solving simple word puzzles:

 /(\w)/; 

Matches doubled word char aa , 11 , __ .

 /(\w)+/; 

Two or more aaa , bb , 222222 .

 /((\w)){2,}/; 

Consecutive pairs aabb , 22__66 ... remember "count left parentheses" rule.

 /([aeiou]).*.*.*/;  /([aeiou])(.*){3}/;  /([aeiou]).*?.*?.*?/; 

Same vowel four times.

Another way.

Non-greedy version matching first four (see Item 17 ).

This kind of thing is always good for 10 minutes of fun on a really slow day. Just sit at your Unix box and type things like:

 %  perl -ne 'print if /([aeiou])(.*){3}/' /usr/dict/words  

I get 106 words from this one, including "tarantara." Hmm.

Backreferences are a powerful feature, but you may not find yourself using them all that often. Sometimes they are handy for dealing with delimiters in a simplistic way:

 /(['"]).*/; 

'stuff' or "stuff" , greedy.

 /(['"]).*?/; 

Non-greedy version (see Item 17 ).

 /(['"])(\.)*?/; 

Handles escapes : \' , \" .

Unfortunately, this approach breaks down quicklyyou can't use it to match parentheses (even without worrying about nesting), and there are faster ways to deal with embedded escapes.

The match variables: $` , $& , $'

In addition to memory variables like $1 , $2 , and $3 , there are three special match variables that refer to the match and the string from which it came. $& refers to the portion of the string that the entire pattern matched, $` refers to the portion of the string preceding the match, and $' refers to the portion of the string following the match. As with memory variables, they are set after each successful pattern match.

Match variables can help with certain types of substitutions in which a replacement string has to be computed:

Go through the contents of OLD a line at a time, replacing some one-line HTML comments.

 while (<OLD>) {    while (/<!--\s*(.*?)\s*-->/g) {      $_ = $` . new_html() . $'        if ok_to_replace();    }    print NEW $_;  } 

Extract info from comment and check it out.

Replace comment.

Some people complain that using match variables makes Perl programs run slower. This is true. There is some extra work involved in maintaining the values of the match variables, and once any of the match variables appears in a program, Perl maintains them for every regular expression match in the program. If you are concerned about speed, you may want to rewrite code that uses match variables. You can generally rephrase such code as substitutions that use memory variables. In the case above, you could do the obvious (but incorrect):

 while (<OLD>) {    while (/<!--\s*(.*?)\s*-->/) {      s/<!--\s*(.*)\s*-->/new_html()/e        if ok_to_replace();    }    print NEW $_;  } 

Use substitution rather than match variables for replacement. However, /g won't work; thus this is broken for lines that contain more than one comment.

Or, a correct but slightly more involved alternative:

 while (<OLD>) {    s{(<!--\s*(.*?)\s*-->)}{      ok_to_replace() ?        new_html() : ;    }eg;    print NEW $_;  } 

Use s///eg for replacement (looks better using braces as delimiters).

In most cases, though, I would recommend that you write whatever makes your code clearer, including using match variables when appropriate. Worry about speed after everything works and you've made your deadline (see Item 22).

The localizing behavior of match variables is the same as that of memory variables.

Memory in substitutions

Memory and match variables are often used in s $2 , $& , and so on within the replacement string of a substitution refer to the memories from the match part, not an earlier statement (hopefully, this is obvious):

 s/(\S+)\s+(\S+)/ /; 

Swap two words.

 %ent = (    '&' => 'amp', '<' => 'lt',    '>' => 'gt'  );  $html =~ s/([&<>])/&$ent{};/g; 

Here is an approach to HTML entity escaping.

a&b becomes a&amp;b

 $newsgroup =~ s/(\w)\w*//g; 

comp.sys.unix becomes c.s.u .

Some substitutions using memory variables can be accomplished without them if you look at what to throw away, rather than what to keep.

 s/^\s*(.*)//; 

Eliminate leading whitespace, hard way.

 s/^\s+//; 

Much better!

 $_ = "FOO=bar BLETCH=baz";  s/(FOO=\S+)\w+=\S+//g; 

Throw away assignments except FOO= .

 s/(thisthat)(\w)/\U/g; 

Uppercase all words except this and that .

You can use the /e (eval) option to help solve some tricky problems:

 s/(\S+\.txt)\b/-e  ?  :    "< not found>"/ge; 

Replace all the nonexistent foo.txt .

Substitutions using /e can sometimes be more legibly written using matching delimiters and possibly the /x option (see Item 21):

 s{    (\S+\.txt)\b   # ending in .txt?  }{    -e  ?  : "< not found>"  }gex; 

Same as above, written with /x option to ignore whitespace (including comments) in pattern.

Matching in a list context

In a list context, the match operator m// returns a list of values corresponding to the contents of the memory variables. If the match is unsuccessful, the match operator returns an empty list. This doesn't change the behavior of the match variables: $1 , $2 , $3 , and so on are still set as usual.

Matching in a list context is one of the most useful features of the match operator. It allows you to scan and split apart a string in a single step:

 ($name, $value) = /^([^:\s]*):\s+(.*)/; 

Parse an RFC822-like header line.

 ($bs, $subject) =    /^subject:\s+(re:\s*)?(.*)/i; 

Get the subject, minus leading re: .

 $subject =    (/^subject:\s+(re:\s*)?(.*)/i)[1]; 

Or, instead of a list assignment, a literal slice.

 ($mode, $fn) = /begin\s+(\d+)\s+(\S+)/i 

Parse a uuencoded file's begin line.

Using a match inside a map is even more succinct. This is one of my favorite ultra -high-level constructs:

 ($date) =    map { /^Date:\s+(.*)/ } @msg_hdr; 

Find the date of a message in not very much Perl.

 @protos =    map { /^(\w+)\s+stream\s+tcp/ } <>;  print "protocols: @protos\n"; 

Produce a list of the named tcp stream protocols by parsing inetd.conf or something similar.

Note that it turns out to be extremely handy that a failed match returns an empty list.

A match with the /g option in a list context returns all the memories for each successful match:

 print "fred quit door" =~ m/(..)\b/g; 

Prints editor last two characters of each word.

Memory-free parentheses

Parentheses in Perl regular expressions serve two different purposes: grouping and memory. Although this is usually convenient , or at least irrelevant, it can get in the way at times. Here's an example we just saw:

 ($bs, $subject) =    /^subject:\s+(re:\s*)?(.*)/i; 

Get the subject, minus leading re: .

We need the first set of parentheses for grouping (so the ? will work right), but they get in the way memory-wise. What we would like to have is the ability to group without memory. Perl 5 introduced a feature for this specific purpose. Memory-free parentheses (?: ) group like parentheses, but they don't create backreferences or memory variables:

 ($subject) =    /^subject:\s+(?:re:\s*)?(.*)/i; 

Get the subject, no bs.

Memory-free parentheses are also handy in the match-inside- map construct (see above), and for avoiding delimiter retention mode in split (see Item 19). In some cases they also may be noticeably faster than ordinary parentheses (see Item 22). On the other hand, memory-free parentheses are a pretty severe impediment to readability and probably are best avoided unless needed.

Tokenizing with regular expressions

Tokenizing or " lexing" a stringdividing it up into lexical elements like whitespace, numbers , identifiers, operators, and so onoffers an interesting application for regular expression memory.

If you have written or tried to write computer language parsers in Perl, you may have discovered that the task can seem downright hard at times. Perl seems to be missing some features that would make things easier. The problem is that when you are tokenizing a string, what you want is to find out which of several possible patterns matches the beginning of a string (or at a particular point in its middle). On the other hand, what Perl is good at is finding out where in a string a single pattern matches. The two don't map onto one another very well.

Let's take the example of parsing simple arithmetic expressions containing numbers, parentheses, and the operators + , - , * , and / . (Let's ignore whitespace, which we could have substituted or tr -ed out beforehand.) One way to do this might be:

 while ($_) {    if (/^(\d+)/) {      push @tok, 'num', ;    } elsif (/^([+\-\/*()])/) {      push @tok, 'punct', ;    } elsif (/^([\d\D])/) {      die "invalid char  in input";    }    $_ = substr($_, length );  } 

Tokenize contents of $_ into array @tok .

Chop off what we recognized and go back for more.

This turns out to be moderately efficient, even if it looks ugly. However, a tokenizer like this one will slow down considerably when fed long strings because of the substr operation at the end. You might think of keeping track of the current starting position in a variable named $pos and then doing something like:

 if (substr($_, $pos) =~ /^(\d+)/) { 

However, this do-it-yourself technique probably won't be much faster and may be slower on short strings.

One approach that works reasonably well, and that is not affected unduly by the length of the text to be lexed, relies on the behavior of the match operator's /g option in a scalar contextwe'll call this a " scalar m//g match." Each time a scalar m//g match is executed, the regular expression engine starts looking for a match at the current " match position," generally after the end of the preceding matchanalogous to the $pos variable mentioned above. In fact, the current match position can be accessed (and changed) through Perl's pos operator. Applying a scalar m//g match allows you to use a single regular expression, and it frees you from having to keep track of the current position explicitly:

 while (/    (\d+)   # number    ([+\-\/*()])   # punctuation    ([\d\D])  # something else  /xg) {    if ( ne "") {      push @tok, 'num', ;    } elsif ( ne "") {      push @tok, 'punct', ;    } else {      die "invalid char  in input";    }  } 

Use a match with the /g option. The /x option is also used to improve readability (see Item 21 ).

Examine $1 , $2 , $3 to see what was matched.

The most recent versions of Perl support a /c option for matches, which modifies the way scalar m//g operates. Normally, when a scalar m//g match fails , the match position is reset, and the next scalar m//g will start matching at the beginning of the target string. The /c option causes the match position to be retained following an unsuccessful match. This, combined with the \G anchor, which forces a match beginning at the last match position, allows you to write more straightforward tokenizers :

 {    if (/\G(\d+)/gc) {      push @tok, 'num', ;    } elsif (/\G([+\-\/*()])/gc) {      push @tok, 'punct', ;    } elsif (/\G([\d\D])/gc) {      die "invalid char  in input";    } else {      last;    }    redo;  } 

A naked block for looping.

Is it a number?

Is it punctuation?

It's something else.

Out of string?

We're done.

Otherwise, loop.

Although it isn't possible to write a single regular expression that matches nested delimiters, with scalar m//gc you can come fairly close:.

Find nested delimiters using scalar m//gc .

Here is an approach to matching nested braces. {qw({ 1 } -1)} is an anonymous hash refit could have been written less succinctly as {('{' => 1, '}' => -1)} .

 $_ = " Here are { nested {} { braces } }!"; 

Input goes into $_ .

 {    my $c;    while (/([{}])/gc) {      last unless ($c += {qw({ 1 } -1)}->{}) > 0    };  }  print substr substr($_, 0, pos()), index($_, "{"); 

$c counts braces.

Find braces and count them until count is .

Print found string.



Effective Perl Programming. Writing Better Programs with Perl
Effective Perl Programming: Writing Better Programs with Perl
ISBN: 0201419750
EAN: 2147483647
Year: 1996
Pages: 116

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net