Let s now continue with another example of parser construction with flex and bison , but in this case, we ll use the data that s parsed. In this example, we ll create a parser for an e-mail firewall configuration that defines from whom we ll accept mail and also from whom we ll reject e-mail. In this example, mail that that s not specified as allowed but also not explicitly specified as disallowed will simply be quarantined for later review by the e-mail recipient. The goal of such an e-mail firewall is to quickly identify e-mail that we expect, disallow e-mail that we don t want, and then cache the rest for later review.
Our configuration file is illustrated in Listing 23.6. We have two sections that define those e-mail addresses that we ll allow (those that we ll immediately allow to make it to the recipient) and also those that we ll disallow (reject or ignore).
1 : allow { 2 : 3 : mtj@mtjones.com 4 : dan_5422@yahoo.com 5 : albert@camus.com 6 : 7 : } 8 : 9 : disallow { 10 : 11 : spammer@spamcorp.com 12 : viagera@pharma.com 13 : 14 : }
The structure of this file ( Listing 23.6) is very simple. Those e-mail addresses in the allow section are permitted, while those in the disallow section are rejected.
Our parser will be constructed in the two phases that we ve demonstrated in our first example. Our configuration file lexer will be defined in a file called config.l , and our grammar parser in config.y . Rather than taking our configuration file from stdin , in this example we ll allow the configuration file to be read from a file.
The lexer is the first phase of our configuration file parser. It will break down the file into its basic elements and return them to the grammar parser. The tokens that we ll permit, from Listing 23.6, are minimal and consist of two reserved words ( allow and disallow ), section delimiters ( { and } ), and finally the e-mail addresses, which will consist of an aggregate of tokens. Whitespace and newlines will simply be ignored. Our flex input file for the configuration file lexer is shown in Listing 23.7.
1 : %{ 2 : #include <stdio.h> 3 : 4 : #include "config.tab.h" 5 : %} 6 : 7 : %% 8 : allow return ALLOW; 9 : disallow return DISALLOW; 10 : [a-zA-Z]+[a-zA-Z0-9_]* yylval=strdup(yytext); return WORD; 11 : \{ return OPEN_BRACE; 12 : \} return CLOSE_BRACE; 13 : \@ return ATSYM; 14 : \. return PERIODSYM; 15 : \n /* Ignore end-of-line */ 16 : [ \t]+ /* Ignore whitespace */ 17 : %%
The first item to note is that we ve specified that we ll need the parser generator header file to understand what tokens will be communicated (line 4). Next, at line 10, we see a new action for a string pattern. This particular pattern recognizes strings that begin with a letter (uppercase or lowercase) and then follow with optional characters of letters , numbers , or the _ character. When we recognize one of these, its value (what was tokenized) is copied into a special variable called yylval . When the lexer recognizes this token, it s stored in yytext , which allows the parser to see the actual token (in addition to the token type that s returned, in this case WORD ).
Now we have a lexer that will tokenize our file and also return the special words that it finds (ignoring the reserved words, given their precedence in the table). Let s now explore the new parser ( Listing 23.8).
1 : %{ 2 : #include <stdio.h> 3 : #include <string.h> 4 : 5 : void yyerror( const char *str ) 6 : { 7 : fprintf( stderr, "error: %s\n", str ); 8 : } 9 : 10 : 11 : int main() 12 : { 13 : FILE *infp; 14 : 15 : infp = fopen( "config.file", "r" ); 16 : 17 : yyrestart( infp ); 18 : 19 : yyparse(); 20 : 21 : fclose( infp ); 22 : 23 : return 0; 24 : } 25 : 26 : 27 : char address[10][80]; 28 : int addrCount = 0; 29 : 30 : %} 31 : 32 : %token ALLOW OPEN_BRACE CLOSE_BRACE DISALLOW WORD ATSYM PERIODSYM 33 : 34 : %% 35 : 36 : configs: 37 : configs config 38 : ; 39 : 40 : config: 41 : allowed 42 : 43 : disallowed 44 : ; 45 : 46 : allowed: ALLOW OPEN_BRACE targets CLOSE_BRACE 47 : { 48 : int i; 49 : printf("Allow these addresses:\n"); 50 : for ( i = 0 ; i < addrCount ; i++ ) { 51 : printf( "\t%s\n", address[i] ); 52 : } 53 : addrCount = 0; 54 : } 55 : ; 56 : 57 : disallowed: DISALLOW OPEN_BRACE targets CLOSE_BRACE 58 : { 59 : int i; 60 : printf("Disallow these addresses:\n"); 61 : for ( i = 0 ; i < addrCount ; i++ ) { 62 : printf( "\t%s\n", address[i] ); 63 : } 64 : addrCount = 0; 65 : } 66 : ; 67 : 68 : 69 : targets: 70 : 71 : targets email_address 72 : ; 73 : 74 : 75 : email_address: 76 : WORD ATSYM WORD PERIODSYM WORD 77 : { 78 : if ( addrCount < 10 ) { 79 : sprintf( address[addrCount++], 80 : "%s@%s.%s", , , ); 81 : free(); free(); free(); 82 : } 83 : } 84 : ;
In the bison declaration section (lines 31 “33) we define the tokens that are expected from the lexer. The result of this line (when the parser generator is created) is a header file containing symbolic constants for each of these tokens.
The rules section (lines 34 “84) defines the grammar for the configuration parser. We include a base rule to allow one or more configuration sections (at lines 36 “38) and then a rule to match either an allow section or a disallow section (lines 40 “44). The allowed and disallowed rules are similar in pattern, except for what they represent (permitted or rejected e-mail addresses). Each begins with the respective reserved word ( allow or disallow and then an open brace ( { ) followed by zero or more e-mail addresses, terminated by a close brace ( } ). We see C code sections for each of the two rules; we ll return to these after we dive into the targets rule.
In the targets rule, we specify that zero or more e-mail addresses can be recognized (lines 69 “72). Our email_address rule (lines 75 “84) specifies a simple e-mail address recognizer. In this case, that is a word followed by a @ symbol, followed by another two words, separated by a ˜. Therefore, e-mail addresses such as:
mtj@mtjones.com
will be recognized, but:
mtj@mail.mtjones.com
will not. For this demonstration, the simple e-mail address spec will suffice.
The action portion of the email_address rule defines what we do when we recognize a valid e-mail address. In this case (as long as we haven t exhausted our storage resources), we populate an address entry with the aggregated e-mail address. Note here that the token list:
WORD ATSYM WORD PERIODSYM WORD
is accessible through a special set of references. The reference $1 represents the first WORD in the list. Recall in our lexer ( Listing 23.7 line 10) that we duplicated the current token using strdup to yylval . The $1 here represents the yylval stored by the lexer. Reference $3 represents the third WORD and $5 the last WORD . Therefore, we use sprintf (to concatenate the strings together into our address array), resulting in a contiguous e-mail address from the individual WORD parts . Once we ve stored the address recognized, we free each of the string references. Recall that we used strdup to copy the current token. This has the effect of malloc ing memory for the new object, and therefore, we must free this resource.
Once the CLOSE_BRACE is recognized for the section (line 46 or 57), we emit the addresses that were found. We know the type of addresses these are ( allowed or disallowed ), given the context of the rule. If we re in the allowed rule, then we know that these addresses are within an allow section. Similarly, within the disallowed rule, we emit the addresses as disallowed. After we emit each section, we zero the addrCount to permit additional sections to be recognized and stored.
Let s look at an example of the parser. Listing 23.9 provides an example file (properly formed ).
1 : allow { 2 : 3 : mtj@mtjones.com 4 : dan_5422@yahoo.com 5 : 6 : } 7 : 8 : disallow { 9 : 10 : you@yahoo.com 11 : them@excite.com 12 : 13 : }
Calling this file config.file (as required by the parser), we demonstrate its use as follows :
# ./parser Allow these addresses: mtj@mtjones.com dan_4522@yahoo.com Disallow these addresses: you@yahoo.com them@excite.com #
If any sections were not properly formed, we d not see any addresses (since the addresses aren t actually printf d until the entire section is parsed).
While this was a simple example, it should illustrate some of the power of flex and bison . flex and bison could be used to build lexers and parsers for very complex languages (such as the C language) or very simple grammars as illustrated here.