Section 2.2. Parse::Yapp | Advanced Perl Programming

2.2. Parse::Yapp

If you're more familiar with tools like yacc, you may prefer to use François Désarménien's Parse::Yapp module. This is more or less a straight port of yacc to Perl.

yacc

yacc, Yet Another Compiler Compiler, is a tool for C programmers to generate a parser from a grammar specification. The grammar specification is much the same as we've seen in our investigation of Parse::RecDescent, but yacc produces bottom-up parsers.

For instance, let's use Parse::Yapp to implement the calculator in Chapter 3 of lex & yacc (O'Reilly). This is a very simple calculator with a symbol table, so you can say things like this:

          a = 25     b = 30     a + b     55

Here's their grammar:

     %{     double vbltable[26];     %}     %union {         double dval;         int vblno;     }     %token <vblno> NAME     %token <dval> NUMBER     %left '-' '+'     %left '*' '/'     %nonassoc UMINUS     %type <dval> expression     %%     statement_list:    statement '\n'         |    statement_list statement '\n'         ;     statement:    NAME '=' expression    { vbltable[$1] = $3; }         |    expression        { printf("= %g\n", $1); }         ;     expression:    expression '+' expression { $$ = $1 + $3; }         |    expression '-' expression { $$ = $1--$3; }         |    expression '*' expression { $$ = $1 * $3; }         |    expression '/' expression                     {    if($3 =  = 0.0)                             yyerror("divide by zero");                         else                             $$ = $1 / $3;                     }         |    '-' expression %prec UMINUS    { $$ = -$2; }         |    '(' expression ')'    { $$ = $2; }         |    NUMBER         |    NAME            { $$ = vbltable[$1]; }         ;     %%

Converting the grammar is very straightforward; the only serious change we need to consider is how to implement the symbol table. We know that Perl's internal symbol tables are just hashes, so that's good enough for us. The other changes are just cosmetic, and we end up with a Parse::Yapp grammar like this:

     %{ my %symtab; %}     %token NAME     %token NUMBER     %left '-' '+'     %left '*' '/'     %nonassoc UMINUS     %%     statement_list: statement '\n'         | statement_list statement '\n'         ;     statement: NAME '=' expression { $symtab{$_[1]} = $_[3]; }         | expression {  print "= ", $_[1], "\n"; }         ;     expression:         expression '+' expression { $_[1] + $_[3] }         | expression '-' expression { $_[1] - $_[3] }         | expression '*' expression { $_[1] * $_[3] }         | expression '/' expression                       {     if ($_[3] =  = 0)                                { $_[0]->YYError("divide by zero") }                             else                                { $_[1] / $_[3] }                       }         | '-' expression %prec UMINUS { -$_[2] }         | '(' expression ')' { $_[2] }         | NUMBER         | NAME { $symtab{$_[1]} }         ;     %%

As you can see, we've declared a hash %symtab to hold the values of the names. Also, notice that that Yacc variables $1, $2, etc. become real subroutine parameters in the @_ array: $_[1], $_[2], and so on.

Next we need to produce a lexer that feeds tokens to the parser. Parse::Yapp expects a subroutine to take input from the data store of the parser object. The Parse::Yapp object is passed in as the first parameter to the lexer, and so the data store ends up looking like $_[0]->YYData->{DATA}.^[*] The lexing subroutine should modify this data store to remove the current token, and then return a two-element list.

^[*] There's nothing in Parse::Yapp that says the data has to live in {DATA}, but it's a good practice. If you have extremely complex data as input, you may want to use several different parts of $_[0]->YYData.

The list should consist of the token type followed by the token data. For instance, in our calculator example, we need to tokenize 12345 as ("NUMBER", 12345). Operators, brackets, equals, and return should be returned as themselves, and names of variables need to be returned as ("NAME", "whatever"). At the end of the input, we need to return an empty string and undef: ('', undef).

Here's a reasonably simple Perl routine that does all of that:

     sub lex {     #    print " Lexer called to handle (".$_[0]->YYData->{DATA}.")\n";         $_[0]->YYData->{DATA} =~ s/^ +//;         return ('', undef) unless length $_[0]->YYData->{DATA};         $_[0]->YYData->{DATA} =~ s/^(\d+)// and return ("NUMBER", $1);         $_[0]->YYData->{DATA} =~ s/^([\n=+\(\)\-\/*])//    and return ($1, $1);         $_[0]->YYData->{DATA} =~ s/^(\w+)//    and return ("NAME", $1);         die "Unknown token (".$_[0]->YYData->{DATA}."\n";     }

Now that we have our grammar and our lexer, we need to run the grammar through the command-line utility yapp to turn it into a usable Perl module. If all is well, this should be a silent process:

     % yapp Calc.yapp     %

and we should have a new file Calc.pm ready for use.

Parse::Lex

Our lexer in this case is pretty simple, so we could code it up in a fairly straightforward subroutine. However, for more difficult lexing operations, it might make sense to use a dedicated lexing language; Parse::Lex is to lex what Parse::Yapp is to yacc. Here's our lexer (crudely) rewritten for Parse::Lex.

     my @lex = qw(         NUMBER \d+         NAME \w+         "+" "+"         "-" "-"         "*" "*"         "/" "/"         "=" "="         "(" "("         ")" ")"     )     my $lexer = Parse::Lex->new(@lex);     $lexer->from(\*STDIN);     sub lex {         my $token = $lexer->next;         return ('', undef) if $lexer->eoi;         return ($token->name, $token->getstring);     }

We can now put it all together: our parser, the lexer, and some code to drive them.

     sub lex {         $_[0]->YYData->{DATA} =~ s/^ +//;         return ('', undef) unless length $_[0]->YYData->{DATA};         $_[0]->YYData->{DATA} =~ s/^(\d+)// and return ("NUMBER", $1);         $_[0]->YYData->{DATA} =~ s/^([\n=+\(\)\-\/*])//    and return ($1, $1);         $_[0]->YYData->{DATA} =~ s/^(\w+)//    and return ("NAME", $1);         die "Unknown token (".$_[0]->YYData->{DATA}.")\n";     }     use Calc;     my $p = Calc->new(  );     undef $/;     $p->YYData->{DATA} = <STDIN>;     $p->YYParse(YYlex => \&lex);

This will take a stream of commands on standard input, run the calculations, and print them out, like this:

     % perl calc     a = 2+4     b = a * 20     b + 15     ^D     =135

For most parsing applications, this is all we need. However, in the case of a calculator, you hardly want to put all the calculations in first and get all the answers out at the end. It needs to be more interactive.

What we need to do is modify the lexer so that it can take data from standard input, using the YYData area as a buffer.

     sub lex {         $_[0]->YYData->{DATA} =~ s/^ +//;         unless (length $_[0]->YYData->{DATA}) {             return ('', undef) if eof STDIN;             $_[0]->YYData->{DATA} = <STDIN>;             $_[0]->YYData->{DATA} =~ s/^ +//;         }         $_[0]->YYData->{DATA} =~ s/^(\d+)// and return ("NUMBER", $1);         $_[0]->YYData->{DATA} =~ s/^([\n=+\(\)\-\/*])//    and return ($1, $1);         $_[0]->YYData->{DATA} =~ s/^(\w+)//    and return ("NAME", $1);         die "Unknown token (".$_[0]->YYData->{DATA}.")\n";     }

This time, we check to see if the buffer's empty, and instead of giving up, we get another line from standard input. If we can't read from that, then we give up. Now we can intersperse results with commands, giving a much more calculator-like feel to the application.