Flylib.com

Books Software

 
 
 

Recipe 6.17 Matching Nested Patterns

Recipe 6.17 Matching Nested Patterns

6.17.1 Problem

You want to match a nested set of enclosing delimiters, such as the arguments to a function call.

6.17.2 Solution

Use match-time pattern interpolation, recursively:

my $np;
$np = qr{
           \(
           (?:
              (?> [^( )]+ )    # Non-capture group w/o backtracking
            
              (??{ $np })     # Group with matching parens
           )*
           \)
        }x;

Or use the Text::Balanced module's extract_bracketed function.

6.17.3 Discussion

The $(??{ CODE }) construct runs the code and interpolates the string that the code returns right back into the pattern. A simple, non-recursive example that matches palindromes demonstrates this:

if ($word =~ /^(\w+)\w?(??{reverse })$/ ) {
    print "$word is a palindrome.\n";
}

Consider a word like "reviver", which this pattern correctly reports as a palindrome. The $1 variable contains " rev " partway through the match. The optional word character following catches the " i ". Then the code reverse $1 runs and produces " ver ", and that result is interpolated into the pattern.

For matching something balanced, you need to recurse, which is a bit tricker. A compiled pattern that uses (??{ CODE }) can refer to itself. The pattern given in the Solution matches a set of nested parentheses, however deep they may go. Given the value of $np in that pattern, you could use it like this to match a function call:

$text = "myfunfun(1,(2*(3+4)),5)";
$funpat = qr/\w+$np/;   # $np as above
$text =~ /^$funpat$/;   # Matches!

You'll find many CPAN modules that help with matching (parsing) nested strings. The Regexp::Common module supplies canned patterns that match many of the tricker strings. For example:

use Regexp::Common;
$text = "myfunfun(1,(2*(3+4)),5)";
if ($text =~ /(\w+\s*$RE{balanced}{-parens=>'( )'})/o) {
  print "Got function call: \n";
}

Other patterns provided by that module match numbers in various notations and quote-delimited strings:

$RE{num}{int}
$RE{num}{real}
$RE{num}{real}{'-base=2'}{'-sep=,'}{'-group=3'}
$RE{quoted}
$RE{delimited}{-delim=>'/'}

The standard (as of v5.8) Text::Balanced module provides a general solution to this problem.

use Text::Balanced qw/extract_bracketed/;
$text = "myfunfun(1,(2*(3+4)),5)";
if (($before, $found, $after)  = extract_bracketed($text, "(")) {
    print "answer is $found\n";
} else {
    print "FAILED\n";
}

6.17.4 See Also

The section on "Match-time pattern interpolation" in Chapter 5 of Programming Perl ; the documentation for the Regexp::Common CPAN module and the standard Text::Balanced module

Recipe 6.18 Expressing AND, OR, and NOT in a Single Pattern

6.18.1 Problem

You have an existing program that accepts a pattern as an argument or as input. It doesn't allow you to add extra logic, like case-insensitive options, ANDs, or NOTs. So you need to write a single pattern that matches either of two different patterns (the "or" case) or both of two patterns (the "and" case), or that reverses the sense of the match ("not").

This situation arises often in configuration files, web forms, or command-line arguments. Imagine there's a program that does this:

chomp($pattern = <CONFIG_FH>);
if ( $data =~ /$pattern/ ) { ..... }

As the maintainer of CONFIG_FH , you need to convey Booleans through to the program using one configuration parameter.

6.18.2 Solution

True if either /ALPHA/ or /BETA/ matches, like /ALPHA/ /BETA/ :

/ALPHABETA/
/(?:ALPHA)(?:BETA)/  # works no matter what in both

True if both /ALPHA/ and /BETA/ match, but may overlap, meaning " BETALPHA " should be okay, like /ALPHA/ && /BETA/ :

/^(?=.*ALPHA)BETA/s

True if both /ALPHA/ and /BETA/ match, but may not overlap, meaning that " BETALPHA " should fail:

/ALPHA.*BETABETA.*ALPHA/s

True if pattern /PAT/ does not match, like $var !~ /PAT/ :

/^(?:(?!PAT).)*$/s

True if pattern BAD does not match, but pattern GOOD does:

/(?=^(?:(?!BAD).)*$)GOOD/s

(You can't actually count on being able to place the /s modifier there after the trailing slash, but we'll show how to include it in the pattern itself at the end of the Discussion.)

6.18.3 Discussion

When in a normal program you want to know whether something doesn't match, use one of:

if (!($string =~ /pattern/)) { something( ) }   # ugly
    if (  $string !~ /pattern/)  { something( ) }   # preferred
unless (  $string =~ /pattern/)  { something( ) }   # sometimes clearer

To see whether both patterns match, use:

if ($string =~ /pat1/ && $string =~ /pat2/ ) { something( ) }

To see whether either of two patterns matches:

if ($string =~ /pat1/  $string =~ /pat2/ ) { something( ) }

Instead of trying to do it all within a single pattern, it's often more efficient and clearer to use Perl's normal Boolean connectives to combine regular expressions. However, imagine a trivially short minigrep program that reads its single pattern as an argument, as shown in Example 6-9.

Example 6-9. minigrep
#!/usr/bin/perl
  # minigrep - trivial grep
  $pat = shift;
  while (<>) {
      print if /$pat/o;
  }

To tell minigrep that some pattern must not match, or that it has to match both subpatterns in any order, you're at an impasse. The program isn't built to accept multiple patterns. How can you do it using one pattern? This need comes up in programs reading patterns from configuration files.

The OR case is pretty easy, since the metacharacter provides for alternation . The AND and NOT cases, however, are more complex.

For AND, you have to distinguish between overlapping and non-overlapping needs. If, for example, you want to see whether a string matches both " bell " and " lab " and allow overlapping, the word " labelled " should be matched. But if you don't want to count overlaps, it shouldn't be matched. The overlapping case uses a lookahead assertion:

"labelled" =~ /^(?=.*bell)lab/s

Remember: in a normal program, you don't have to go through these contortions. Simply say:

$string =~ /bell/ && $string =~ /lab/

To unravel this, we'll spell it out using /x and comments. Here's the long version:

if ($murray_hill =~ m{
             ^              # start of string
            (?=             # zero-width lookahead
                .*          # any amount of intervening stuff
                bell        # the desired bell string
            )               # rewind, since we were only looking
            lab             # and the lab part
         }sx )              # /s means . can match newline
{
    print "Looks like Bell Labs might be in Murray Hill!\n";
}

We didn't use .*? to end early, because minimal matching is more expensive than maximal matching. It's more efficient to use .* over .*? , given random input where the occurrence of matches at the front or the end of the string is completely unpredictable. Of course, sometimes choosing between .* and .*? may depend on correctness rather than efficiency, but not here.

To handle the non-overlapping case, you need two parts separated by an OR. The first branch is THIS followed by THAT; the second is the other way around:

"labelled" =~ /(?:^.*bell.*lab)(?:^.*lab.*bell)/

or in long form:

$brand = "labelled";
if ($brand =~ m{
        (?:                 # non-capturing grouper
              bell          # look for a bell
              .*?           # followed by any amount of anything
              lab           # look for a lab
          )                 # end grouper
                           # otherwise, try the other direction
        (?:                 # non-capturing grouper
              lab           # look for a lab
              .*?           # followed by any amount of anything
              bell          # followed by a bell
          )                 # end grouper
    }sx )                   # /s means . can match newline
{
    print "Our brand has bell and lab separate.\n";
}

Neither of those patterns matches the test data of " labelled ", since there " bell " and " lab " do overlap.

These patterns aren't necessarily efficient. $murray_hill =~ /bell/ && $murray_hill =~ /lab/ scans the string at most twice, but the pattern-matching engine's only option is to try to find a " lab " for each occurrence of " bell " with (?=^.*?bell)(?=^.*?lab) , leading to quadratic worst-case running times.

If you followed those examples, the NOT case should be a breeze . The general form looks like this:

$map =~ /^(?:(?!waldo).)*$/s

Spelled out in long form, this yields:

if ($map =~ m{
        ^                   # start of string
        (?:                 # clustering grouper
            (?!             # look ahead negation
                waldo       # is he ahead of us now?
            )               # is so, the negation failed
            .               # any character (cuzza /s)
        ) *                 # repeat that grouping 0 or more
        $                   # through the end of the string
    }sx )                   # /s means . can match newline
{
    print "There's no waldo here!\n";
}

How would you combine AND, OR, and NOT? It's not a pretty picture, and in a regular program, you'd almost never do this. But you have little choice when you're reading from a config file or pulling in arguments from the command line, because you specify only one pattern. You just have to combine what we've learned so far. Carefully.

Let's say you wanted to run the Unix w program and find out whether user tchrist were logged on anywhere but a terminal whose name began with ttyp ; that is, tchrist must match, but ttyp must not.

Here's sample input from w :



7:15am  up 206 days, 13:30,  4 users,  load average: 1.04, 1.07, 1.04




USER     TTY      FROM              LOGIN@  IDLE   JCPU   PCPU  WHAT




tchrist  tty1                       5:16pm 36days 24:43   0.03s  xinit




tchrist  tty2                       5:19pm  6days  0.43s  0.43s  -tcsh




tchrist  ttyp0    chthon            7:58am  3days 23.44s  0.44s  -tcsh




gnat     ttyS4    coprolith         2:01pm 13:36m  0.30s  0.30s  -tcsh


Here's how to do that using the minigrep program previously outlined or with the tcgrep program from the end of this chapter:

% w  minigrep '(?!.*ttyp)tchrist'

Decoding that pattern:

m{
    (?!                     # zero-width look-ahead assertion
        .*                  # any amount of anything (faster than .*?)
        ttyp                # the string you don't want to find
    )                       # end look-ahead negation; rewind to start
    tchrist                 # now try to find Tom
}x

Of course, this example is contrived: any sane person would call the standard grep program twice, once with a -v option to select only non-matches.

% w  grep tchrist  grep -v ttyp

The point is that Boolean conjunctions and negations can be coded up in one single pattern. You should comment this kind of thing, though, having pity on those who come after you -before they do.

One last thing: how would you embed that /s in a pattern passed to a program from the command line? The same way as you would a /i modifier: by using (?i) in the pattern. The /s and /m modifiers can be painlessly included in a pattern as well, using (?s) or (?m) . These can even cluster, as in (?smi) . That would make these two reasonably interchangeable:

% grep -i 'pattern' files
% minigrep '(?i)pattern' files

When you turn on a modifier that way, it remains on for the entire pattern. An alternative notation restricts the scope of the modifier. Use a clustering parenthesis set, (?:...) , and place the modifiers between the question mark and the colon . Printing out a qr// quoted regex demonstrates how to do this:

% perl -le 'print qr/pattern/i'


(?i-xsm:pattern)


Modifiers placed before a minus are enabled for just that pattern; those placed after the minus are disabled for that pattern.

6.18.4 See Also

Lookahead assertions are shown in the "Regular Expressions" section of perlre (1), and in the "Lookaround Assertions" section of Chapter 5 of Programming Perl ; your system's grep (1) and w (1) manpages; we talk about configuration files in Recipe 8.16