Section 12.17. Piecewise Matching

12.17. Piecewise Matching

Tokenize input using the /gc flag.

The typical approach to breaking an input string into individual tokens is to "nibble" at it, repeatedly biting off the start of the input string with successive substitutions:

     while (length $input > 0) {         if ($input =~ s{\A ($KEYWORD)}{}xms) {             my $keyword = $1;             push @tokens, start_cmd($keyword);         }         elsif ($input =~ s{\A ($IDENT)}{}xms) {             my $ident = $1;             push @tokens, make_ident($ident);         }         elsif ($input =~ s{\A ($BLOCK)}{}xms) {             my $block = $1;             push @tokens, make_block($block);         }         else {             my ($context) = $input =~ m/ \A ([^\n]*) /xms;             croak "Error near: $context";         }     }

But this approach requires a modification to the $input string on every successful match, which makes it expensive to start with, and then causes it to scale badly as well. Nibbling away at strings is slow and gets slower as the strings get bigger.

In Perl 5.004 and later, there's a much better way to use regexes for tokenizing an input: you can just "walk" the string, using the /gc flag. The /gc flag tells a regex to track where each successful match finishes matching. You can then access that "end-of-the-last-match" position via the built-in pos( ) function. There is also a \G metacharacter, which is a positional anchor, just like \A is. However, whereas \A tells the regex to match only at the start of the string, \G tells it to match only where the previous successful /gc match finished. If no previous /gc match was successful, \G acts like a \A and matches only at the start of the string.

All of which means that, instead of using a regex substitution to lop each token off the start of the string (s{\A...}{}), you can simply use a regex match to start looking for the next token at the point where the previous token match finished (m{\G...}gc).

So the previous tokenizer could be rewritten more efficiently as:

           # Reset the matching position of $input to the beginning of the string...
     pos $input = 0;     # ...and continue until the matching position is past the last character...
     while (pos $input < length $input) {         if ($input =~ m{ \G ($KEYWORD) }gcxms) {             my $keyword = $1;             push @tokens, start_cmd($keyword);         }         elsif ($input =~ m{ \G ( $IDENT) }gcxms) {             my $ident = $1;             push @tokens, make_ident($ident);         }         elsif ($input =~ m{ \G ($BLOCK) }gcxms) {             my $block = $1;             push @tokens, make_block($block);         }         else {             $input =~ m/ \G ([^\n]*) /gcxms;             my $context = $1;             croak "Error near: $context";         }     }

Of course, because this style of parsing inevitably spawns a series of cascaded if statements that all feed the same @tokens array, it's even better practice to use the ternary operator and create a "parsing table" (see "Tabular Ternaries" in Chapter 6):

      while (pos $input < length $input) {         push @tokens,  (                             # For token type...      #  Build token...
                  $input =~ m{ \G ($KEYWORD) }gcxms  ?  start_cmd($1)                : $input =~ m{ \G ( $IDENT ) }gcxms  ?  make_ident($1)                : $input =~ m{ \G ( $BLOCK ) }gcxms  ?  make_block($1)                : $input =~ m{ \G ( [^\n]* ) }gcxms  ?  croak "Error near:$1"                :                                       die 'Internal error'         );     }

Note that these examples don't use direct list capturing to rename the capture variables (as recommended in the preceding guideline). Instead they pass $1 into a token-constructing subroutine immediately after the match. That's because a list capture would cause the regex to match in list context, which would force the /g component of the flag to incorrectly match every occurrence of the pattern, rather than just the next one.