2.11 PROPERTIES OF REGULAR SETS | Algorithms for Compiler Design (Electrical and Computer Engineering Series)

2.10 LEXICAL ANALYZER DESIGN

Since the function of the lexical analyzer is to scan the source program and produce a stream of tokens as output, the issues involved in the design of lexical analyzer are:

Identifying the tokens of the language for which the lexical analyzer is to be built, and to specify these tokens by using suitable notation, and
Constructing a suitable recognizer for these tokens.

Therefore, the first thing that is required is to identify what the keywords are, what the operators are, and what the delimiters are. These are the tokens of the language. After identifying the tokens of the language, we must use suitable notation to specify these tokens. This notation, should be compact, precise, and easy to understand. Regular expressions can be used to specify a set of strings, and a set of strings that can be specified by using regular-expression notation is called a "regular set." The tokens of a programming language constitutes a regular set. Hence, this regular set can be specified by using regular-expression notation. Therefore, we write regular expressions for things like operators, keywords, and identifiers. For example, the regular expressions specifying the subset of tokens of typical programming language are as follows :

 operators = + - * /  moddiv    keywords   = ifwhiledothen       letter = abcd....zABC....Z        digit = 0123456789    identifier = letter (letterdigit)*

The advantage of using regular-expression notation for specifying tokens is that when regular expressions are used, the recognizer for the tokens ends up being a DFA. Therefore, the next step is the construction of a DFA from the regular expression that specifies the tokens of the language. But the DFA is a flow-chart (graphical) representation of the lexical analyzer. Therefore, after constructing the DFA, the next step is to write a program in suitable programming language that will simulate the DFA. This program acts as a token recognizer or lexical analyzer. Therefore, we find that by using regular expressions for specifying the tokens, designing a lexical analyzer becomes a simple mechanical process that involves transforming regular expressions into finite automata and generating the program for simulating the finite automata .

Therefore, it is possible to automate the procedure of obtaining the lexical analyzer from the regular expressions and specifying the tokens ”and this is what precisely the tool LEX is used to do. LEX is a compiler-writing tool that facilitates writing the lexical analyzer, and hence a compiler. It inputs a regular expression that specifies the token to be recognized and generates a C program as output that acts as a lexical analyzer for the tokens specified by the inputted regular expressions.

2.10.1 Format of the Input or Source File of LEX

The LEX source file contains two things:

Auxiliary definitions having the format: name = regular expression.
The purpose of the auxiliary definitions is to identify the larger regular expressions by using suitable names .
LEX makes use of the auxiliary definitions to replace the names used for specifying the patterns of corresponding regular expressions.
The translation rules having the format:
- pattern {action}.

The ˜pattern specification is a regular expression that specifies the tokens, and ˜{action} is a program fragment written in C to specify the action to be taken by the lexical analyzer generated by LEX when it encounters a string matching the pattern. Normally, the action taken by the lexical analyzer is to return a pair to the parser or syntax analyzer. The first member of the pair is a token, and the second member is the value or attribute of the token. For example, if the token is an identifier, then the value of the token is a pointer to the symbol-table record that contains the corresponding name of the identifier. Hence, the action taken by the lexical analyzer is to install the name in the symbol table and return the token as an id, and to set the value of the token as a pointer to the symbol table record where the name is installed. Consider the following sample source program:

 letter                       [  a-z  ,  A-Z  ] digit                        [ 0-9 ] %% begin                        { return ("BEGIN")} end                          { return ("END")} if                           {return ("IF")} letter ( letterdigit)*      { install ( );                              return ("identifier")                              } <                            { return ("LT")} < =                          { return ("LE")} %% definition of install()

In the above specification, we find that the keyword ˜begin can be matched against two patterns one specifying the keyword and the other specifying identifiers. In this case, pattern-matching is done against whichever pattern comes first in the physical order of the specification. Hence, ˜begin will be recognized as a keyword and not as an identifier. Therefore, patterns that specify keywords of the language are required to be listed before a pattern-specifying identifier; otherwise , every keyword will get recognized as identifier. A lexical analyzer generated by LEX always tries to recognize the longest prefix of the input as a token. Hence, if < = is read, it will be recognized as a token " LE " not " LT ."