Recipe2.6.Implementing a Poor Man s Tokenizer to Deconstruct a String


Recipe 2.6. Implementing a Poor Man's Tokenizer to Deconstruct a String

Problem

You need a quick method of breaking up a string into a series of discrete tokens or words.

Solution

Use the Split instance method of the String class. For example:

 string equation = "1 + 2 - 4 * 5"; string[] equationTokens = equation.Split(new char[1]{' '}); foreach (string Tok in equationTokens)    Console.WriteLine(Tok); 

This code produces the following output:

 1 + 2 - 4 * 5 

The Split method may also be used to separate people's first, middle, and last names. For example:

 string fullName1 = "John Doe"; string fullName2 = "Doe, John"; string fullName3 = "John Q. Doe"; string[] nameTokens1 = fullName1.Split(new char[3]{' ', ',', '.'}); string[] nameTokens2 = fullName2.Split(new char[3]{' ', ',', '.'}); string[] nameTokens3 = fullName3.Split(new char[3]{' ', ',', '.'}); foreach (string tok in nameTokens1) {    Console.WriteLine(tok); } Console.WriteLine(""); foreach (string tok in nameTokens2) {    Console.WriteLine(tok); } Console.WriteLine(""); foreach (string tok in nameTokens3) {    Console.WriteLine(tok); } 

This code produces the following output:

 John Doe Doe John John Q Doe 

Notice that a blank is inserted between the period and the space delimiters of the fullName3 name; this is correct behavior. If you do not want to process this space in your code, you can choose to ignore it.

Discussion

If you have a consistent string with parts, or tokens, that are separated by well-defined characters, the Split function can tokenize the string. Tokenizing a string consists of breaking the string down into well-defined, discrete parts, each of which is considered a token. In the two previous examples, the tokens were either parts of a mathematical equation (numbers and operators) or parts of a name (first, middle, and last).

There are several drawbacks to this approach. First, if the string of tokens is not separated by any well-defined character(s), it will be impossible to use the Split method to break up the string. For example, if the equation string looked like this:

 string equation = "1+2-4*5"; 

you would clearly have to use a more robust method of tokenizing this string (see Recipe 10.7 for a more robust tokenizer).

A second drawback is that a string of tokenized words must be entered consistently in order to gain meaning from the tokens. For example, if you ask users to type in their names, they may enter any of the following:

 John Doe Doe John John Q Doe 

If one user enters his name the first way and another user enters it the second way, your code will have a difficult time determining whether the first token in the string array represents the first or last name. The same problem will exist for all of the other tokens in the array. However, if all users enter their names in a consistent style, such as First Name, space, Last Name, you will have a much easier time tokenizing the name and understanding what each token represents.

See Also

See Recipe 10.7; see the "String.Split Method" topic in the MSDN documentation.



C# Cookbook
Secure Programming Cookbook for C and C++: Recipes for Cryptography, Authentication, Input Validation & More
ISBN: 0596003943
EAN: 2147483647
Year: 2004
Pages: 424

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net