Recipe 2.6 The Poor Man s Tokenizer

Recipe 2.6 The Poor Man's Tokenizer

Problem

You need a quick method of breaking up a string into a series of discrete tokens or words.

Solution

Use the Split instance method of the string class. For example:

 string equation = "1 + 2 - 4 * 5"; string[] equationTokens = equation.Split(new char[1]{' '}); foreach (string Tok in equationTokens)    Console.WriteLine(Tok); 

This code produces the following output:

 1 + 2 - 4 * 5 

The Split method may also be used to separate people's first, middle, and last names . For example:

 string fullName1 = "John Doe"; string fullName2 = "Doe,John"; string fullName3 = "John Q. Doe"; string[] nameTokens1 = fullName1.Split(new char[3]{' ', ',', '.'}); string[] nameTokens2 = fullName2.Split(new char[3]{' ', ',', '.'}); string[] nameTokens3 = fullName3.Split(new char[3]{' ', ',', '.'}); foreach (string tok in nameTokens1) {    Console.WriteLine(tok); } Console.WriteLine(""); foreach (string tok in nameTokens2) {    Console.WriteLine(tok); } Console.WriteLine(""); foreach (string tok in nameTokens3) {    Console.WriteLine(tok); } 

This code produces the following output:

 John Doe Doe John John Q Doe 

Notice that a blank is inserted between the ' . ' and the space delimiters of the fullName3 name ; this is correct behavior. If you did not want to process this space in your code, you can choose to ignore it.

Discussion

If you have a consistent string whose parts , or tokens , are separated by well-defined characters , the Split function can tokenize the string. Tokenizing a string consists of breaking the string down into well-defined, discrete parts, each of which is considered a token. In the two previous examples, the tokens were either parts of a mathematical equation ( numbers and operators) or parts of a name (first, middle, and last).

There are several drawbacks to this approach. First, if the string of tokens is not separated by any well-defined character(s), it will be impossible to use the Split method to break up the string. For example, if the equation string looked like this:

 string equation = "1+2-4*5"; 

we would clearly have to use a more robust method of tokenizing this string (see Recipe 8.7 for a more robust tokenizer).

A second drawback is that a string of tokenized words must be entered consistently in order to gain meaning from the tokens. For example, if we ask users to type in their names, they may enter any of the following:

 John Doe Doe John John Q Doe 

If one user enters in his name the first way and another user enters it the second way, our code will have a difficult time determining whether the first token in the string array represents the first or last name. The same problem will exist for all of the other tokens in the array. However, if all users enter their names in a consistent style, such as First Name , space, Last Name , we will have a much easier time tokenizing the name and understanding what each token represents.

See Also

See Recipe 8.7; see the "String.Split Method" topic in the MSDN documentation.



C# Cookbook
C# 3.0 Cookbook
ISBN: 059651610X
EAN: 2147483647
Year: 2003
Pages: 315

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net