Recipe10.7.Implementing a Better Tokenizer

Recipe 10.7. Implementing a Better Tokenizer

Problem

A simple method of tokenizingor breaking up a string into its discrete elements was presented in Recipe 2.6. However, this is not powerful enough to handle all your string-tokenizing needs. You need a tokenizeralso referred to as a lexerthat can split up a string based on a well-defined set of characters.

Solution

Using the Split method of the Regex class, you can use a regular expression to indicate the types of tokens and separators that you are interested in gathering. This technique works especially well with equations, since the tokens of an equation are well defined. For example, the code:

 using System; using System.Text.RegularExpressions; public static string[] Tokenize(string equation) {     Regex RE = new Regex(@"([\+\-\*\(\)\^\\])");     return (RE.Split(equation)); }

will divide up a string according to the regular expression specified in the Regex constructor. In other words, the string passed in to the Tokenize method will be divided up based on the delimiters +, -, *, (, ), ^, and \. The following method will call the Tokenize method to tokenize the equation (y - 3)(3111*x^21 + x + 320):

 public void TestTokenize( ) {     foreach(string token in Tokenize("(y - 3)(3111*x^21 + x + 320)"))         Console.WriteLine("String token = " + token.Trim( )); }

which displays the following output:

 string token = String token = ( String token = y String token = - String token = 3 String token = ) String token = String token = ( String token = 3111 String token = * String token = x String token = ^ String token = 21 String token = + String token = x String token = + String token = 320 String token = ) String token =

Notice that each individual operator, parenthesis, and number has been broken out into its own separate token.

Discussion

The tokenizer created in Recipe 2.6 would be useful in specific controlled circumstances. However, in real-world projects, you do not always have the luxury of being able to control the set of inputs to your code. By making use of regular expressions, you can take the original tokenizer and make it flexible enough to allow it to be applied to any type or style of input you desire.

The key method used here is the Split instance method of the Regex class. The return value of this method is a string array with elements that include each individual token of the source stringthe equation, in this case.

Notice that the static method allows RegexOptions enumeration values to be used, while the instance method allows for a starting position to be defined and a maximum number of matches to occur. This may have some bearing on whether you choose the static or instance method.