11.2 Regular Expressions

Regular expressions can be thought of as programs whose job it is to see if a string of text fits a pattern. These programs are often executed by translating them into a finite state machine, a kind of virtual computer with a very limited instruction set and no memory. Here is an example of a regular expression that matches floating-point numbers in Java:

` [0-9]*.([0-9]*)?([eE][-+]?[0-9]+)?[fFdD]? `

Figure 11.1 shows how you interpret this expression. Here is how a regular expression matches the string of characters 3.14. The 3 matches [0-9]* in the first box. The period matches the beginning of the second box. The digits 1 and 4 match the rest of the second box. The third and fourth boxes are optional (indicated by the ? at the end). Similarly, a more exotic number like 6.022e+23d also matches the regular expression.

Figure 11.1. Reading a regular expression

Something like 2BorNot2B does not match. Although the 2 matches with the first box, the letter B does not match with anything.

An easy way to find out whether or not a regular expression matches a string of input text is to use a finite state machine. Figure 11.2 shows a finite state machine for the example regular expression. Each circle represents a state. The labeled arrows between states are called transitions. The double circles are special states called terminal states.

Figure 11.2. Finite state machine recognizing floating-point numbers

To use a finite state machine, you start at the state labeled start. Then you look at the first character in the input. If you see an outbound arrow from start labeled with that character, then follow the arrow to the new state. Repeat with the next character, using the new state, until you reach the end of the input. If you end up in a terminal state, then the regular expression matches the input string. If the final state isn't a terminal state, or if you find yourself in a state where there is no outbound arrow labeled with the next input character, then the regular expression does not match the input string.

To recognize the number 6.022e+23, you would follow the diagram like this:

State Next character New state
Start 6 1
1 . 2
2 0 2
2 2 2
2 2 2
2 e 3
3 + 4
4 2 5
5 3 5

Since 5 is a terminal state, the number 6.022e+23 is a valid floating-point number. By contrast, 2.eD is not:

State Next character New state
Start 2 1
1 . 2
2 e 3
3 D none

Since there is no transition from state 3 labeled D, the string 2.eD is not a valid floating-point number.

To write a program to match regular expressions to input strings, programs often incorporate virtual machines to simulate the finite state machines. As a JVM programmer, you've got a portable virtual machine always accessible, without having to create one of your own. This is the plan: each state is a location in the program. Transitions are gotos from one state to another. At each state, read the next character and do a goto to the next state based on the character.

Here is the code to match the finite state machine shown:

` .method static matchFloatingPoint(Ljava/lang/String;)Z .var 0 is string  Ljava/lang/String;                           ; String to match .var 1 is index I         ; Current index .var 2 is length I        ; Length of the string iconst_m1 istore_1                  ; Initialize index aload_0 invokevirtual java/lang/String/length()I istore_2                  ; Initialize length start: jsr next_char_non_accept lookupswitch '.': state2 '0': state1 '1': state1 '2': state1 '3': state1 '4': state1 '5': state1 '6': state1 '7': state1 '8': state1 '9': state1 default: fail state1: jsr next_char_non_accept lookupswitch 'e': state3 'E': state3 'f': state7 'F': state7 'd': state7 'D': state7 '.': state2 '0': state1 '1': state1 '2': state1 '3': state1 '4': state1 '5': state1 '6': state1 '7': state1 '8': state1 '9': state1 default: fail state2: jsr next_char_accept lookupswitch 'e': state3 'E': state3 'f': state6 'F': state6 'd': state6 'D': state6 '0': state2 '1': state2 '2': state2 '3': state2 '4': state2 '5': state2 '6': state2 '7': state2 '8': state2 '9': state2 default: fail state3: jsr next_char_non_accept lookupswitch '+': state3 '-': state4 '0': state5 '1': state5 '2': state5 '3': state5 '4': state5 '5': state5 '6': state5 '7': state5 '8': state5 '9': state5 default: fail state4: jsr next_char_non_accept lookupswitch '0': state5 '1': state5 '2': state5 '3': state5 '4': state5 '5': state5 '6': state5 '7': state5 '8': state5 '9': state5 default: fail state5: jsr next_char_non_accept lookupswitch 'f': state6 'F': state6 'd': state6 'D': state6 '0': state5 '1': state5 '2': state5 '3': state5 '4': state5 '5': state5 '6': state5 '7': state5 '8': state5 '9': state5 default: fail state6: ; No more input is expected, and a float has been recognized goto succeed `

The code at next_char_accept and next_char_non_accept are subroutines. Each retrieves the next character from the input and places it on the stack. If there is no next character, next_char_accept goes to succeed, and next_char_non_accept goes to fail. The subroutine next_char_accept is used in the terminal states, where the end of the input means that the input is acceptable. The subroutine next_char_non_accept is used in nonterminal states, where more input is expected. The code for these subroutines is

` ; Go to the next character, and fail if it's EOF next_char_non_accept: astore_3              ; Store the return address iinc 1 1              ; Increment index iload_1 iload_2 if_icmpge fail        ; If index is past the end, fail aload_0               ; Push character iload_1 invokevirtual java/lang/String/charAt (I)C ret 3 ; Go to the next character, and succeed if it's EOF next_char_accept: astore_3              ; Store the return address iinc 1 1              ; Increment index iload_1 iload_2 if_icmpge succeed     ; If index is past the end, succeed aload_0               ; Push character iload_1 invokevirtual java/lang/String/charAt (I)C ret 3 succeed:    ; The input string matches the regular expression    iconst_1    ireturn            ; Return true fail:    ; The input string does not match the regular expression    iconst_0    ireturn            ; Return false .end method `

The method takes a String as its input and returns true if it is a valid floating-point number, false otherwise.

Each state begins with a call to a subroutine, either next_char_accept or next_char_non_accept. If the current state is a terminal state, then it calls next_char_accept; otherwise, it calls next_char_non_accept.

The difference between next_char_accept and next_char_non_accept is what happens when you run out of input. Subroutine next_char_accept is called from a terminal state, which means that the input is acceptable if there are no more characters, so the method returns true. Subroutine next_char_non_accept is called from a nonterminal state, which means that the input is not acceptable, and the method returns false.

Both subroutines read the next character in the input and leave it on the stack. This character drives the lookupswitch instruction, which uses it to determine which state to go to next. The code for the lookupswitch follows directly from the diagram in Figure 11.2. For each transition, there is an entry in the lookupswitch that sends the finite state machine to that state. If the character isn't recognized, then go to fail.

As long as there is more input, the method keeps jumping from location to location. When there is no more input, the method goes to either succeed or fail, depending on whether the last state was a terminal state or not. This action returns true or false, whether the input was acceptable or not.

Programming for the Javaв„ў Virtual Machine
ISBN: 0201309726
EAN: 2147483647
Year: 1998
Pages: 158
Authors: Joshua Engel