The string Type

   


The string type

Single characters are not much fun on their own but, when put together, they constitute a powerful communication tool. The string type does exactly this for us, providing a powerful means of representing and working with strings of characters. strings are consequently used in most of your programs for names, descriptions, addresses, headings, and so on.

The word string is an alias for the .NET Framework class String located in the System namespace. Whenever you declare a variable to be of type string, you are, in fact, specifying this variable to be able to hold a reference to a System.String object, which then encapsulates your text. Consequently, the following two lines of source code have exactly the same meaning.


graphics/07infig31.gif

The string class comes packed with an entire host of features all based in the .NET Framework class library. You will see many examples of these features in this section.

string Literals and string Objects

The specification of a string literal, such as

 "This is a string" 

is familiar. The two double quotes enable the compiler to treat the individual characters between them as one single item. A string literal is an object of class string.

Recall how new objects of classes usually must be created with the use of the new keyword, such as the following:


graphics/07infig32.gif

However, you can assign a string literal to a variable of type string simply by writing omitting the use of new. This is a convenient feature. Due to string types fundamental place in C# programming and to improve its ease of use, it has been given special abilities, such as the one mentioned here.


graphics/07infig33.gif

string is a reference type. Consequently, the latter assignment results in someText referring to a string object holding the text "This is a string" as follows:


graphics/07infig34.gif

It is possible to let someText refer to another string object. For example, we can write

 someText = "I'm the new string in the block"; 

which will move the reference from the string object holding the "This is a string" to the new object as follows:


graphics/07infig35.gif

Verbatim Strings

From the previous char type section, recall how the problem of representing the backslash character '\' in a string could be solved. C# offers an elegant alternative to this procedure: By specifying a string literal to be interpreted literally (what you see is what you get), you can prevent the compiler from viewing the backslash \character as a "begin-escape-sequence" symbol. A factually interpreted string is called a verbatim string literal and is indicated by putting an @ in front of the string literal. For example, to represent the previous path ("D:\MyFiles\temp\BliposClock.cs"), you can use the following instead of using the \\escape sequence as used in the char section:


graphics/07infig36.gif

This produces exactly the same output.

Working with Strings

The string (System.String) class contains a vast array of useful methods that enable you to, for example,

  • Concatenate two or more strings.

  • Compare two strings.

  • Access individual characters or sub-strings of a particular string.

  • Insert part of one string into another string.

  • Copy a string.

  • Find the number of characters in a string.

To utilize these methods, we need a system by which we can identify individual characters and sub-strings in a string. For example, we might need to specify that we want to copy the sub-string "and" from the following string:

 ".NET and C#." 

The string class uses a sequentially numbered index starting with 0 (not 1), as illustrated in Figure 7.7, to accomplish this. Notice how the periods and blanks are counted as part of the index and that the period in ".NET" is located at position 0.

Figure 7.7. The underlying index of the string class.
graphics/07fig07.gif

Thus, to reach the sub-string "and", we could write the equivalent of "get 3 characters starting at position 5." In C#, this is written as line 3 in the following code snippet:

 01:       string andString; 02:       string myString = ".NET and C#"; 03:       andString = myString.SubString(5, 3); 

Line 3 utilizes the SubString method (see Table 7.6) of the string class to copy "and" from myString to andString. In particular, myString.SubString (5, 3) instructs the program to return a sub-string from myString, beginning at index 5 and including 3 characters.

Table 7.6 lists some of the string class methods and properties along with explanations and examples. The remaining methods can be found in the .NET Framework Documentation.

Table 7.6. Selected Methods and Properties of the string Class
Method/Property Explanation Example
salutation.IndexOf (<Any_String>) Returns the position of the first occurrence of the sub-string <Any_String> in salutation. 1 is returned if <Any_String> is not found.

salutation. IndexOf("Morn") returns 5

salutation. IndexOf("Hello") returns 1

salutation.Length Returns the number of characters in the string. salutation.Length returns 13
salutation[<index>] Where <index> is a positive integer Returns the character at the <index> of string. salutation[3] returns 'd'
salutation.Equals (<Any_String>)

If salutation and <Any_String> represents identical strings, this method returns true; otherwise, false.

Note: the comparison operator == provides same comparison. Thus, salutation.Equals (<Any_String>) is equivalent to salutation == <Any_String>.

salutation. Equals("Good Morning!") Returns true salutation. Equals("Good Afternoon!") Returns false. salutation == "Good Morning!" Returns true.
salutation.ToUpper() Returns a string with the same sequence of characters as salutation, but all letters are converted to uppercase. salutation. ToUpper()Returns "GOOD MORNING!"
salutation.ToLower() Returns a string with the same sequence of characters as salutation, but all letters are converted to lowercase. salutation. ToLower() Returns "good morning!"
salutation. LastIndexOf (<Any_String>) Returns the position of the last occurence of the sub-string <Any_String> in salutation. 1 is returned if <Any_String> is not found. salutation. LastIndexOf("o") Returns 6.
salutation.CompareTo (<Any_String>) Compares salutation with <Any_String> by utilizing the corresponding Unicode value of each character. If salutation is less than <Any_String>, it returns a negative value. If salutation. is greater than <Any_String>, it returns a positive value. If the two strings are equal, it returns zero. Only if both strings consist of either upper- or lowercase letters will the comparison constitute a valid lexicographic (alphabetic) evaluation. salutation. CompareTo ("Afternoon") Returns a positive value salutation CompareTo ("Night") Returns a negative value salutation. CompareTo("Good Morning!") Returns zero. salutation. CompareTo("good Morning") Returns a positive value salutation. CompareTo("GOOD MORNING!" Returns a negative value.
salutation. Substring (<Index_Start>) Returns the sub-string of salutation starting from <Index_Start> continuing to the end of salutation. salutation. Substring(6) Returns "orning!"
salutation. Substring (<Index_Start>, <Length>) Returns the sub-string of salutation starting from <Index_Start> and including <Length> number of characters. salutation. Substring(6,3) Returns "orn"
salutation.Insert (<Index_Insert> , <Any_String>) Returns a string with <Any_String> inserted at position <Index_Insert> of salutation. salutation.Insert (5, "Better Best ") Returns "Good Better Best Morning!"
string.Copy (salutation) Returns a new copy of salutation. String.Copy (salutation) Returns "Good Morning!"
Format_Specifier ::= {<Argument_Index> [,<Width>] [:<FormatString> } 

Notes:

<Argument_Index> is the zero-based index of the argument to be formatted.

<Width> is the width (in number of space characters) of the section containing the parameter. A negative number indicates the parameter to be left-aligned, a positive to be right-aligned. <FormatString> is a string of special formatting codes with conventions identical to that of the ToString methods of the numeric types.

Allows you to construct strings containing normal static text with special markers, called Format_Specifiers, embedded at suitable positions. Each format specifier indicates where an argument (specified in the Console.WriteLine() method call) is to be inserted, along with an optional width and format.

Console.WriteLine ("N1: N2: { 0} { 1} N3: { 2} ", 15, 28, 39);Prints: N1: 15 N2: 28 N3: 39

Console.WriteLine ("Amount: { 0:C} ", 50000000.456); Prints: Amount: $50,000,000.46

Console.WriteLine ("Distance1: { 0:E2} Distance2: { 1:E4} ", 75463728991, 901100000000); Prints: Distance1: 7.55E+010 Distance2: 9.0110E+011

Console.WriteLine ("Mass: { 0,10} grams. Energy: { 1,10} joules",100, 300); Prints: Mass: 100 grams Energy: 300 joules

Console.WriteLine ("Mass: { 0,-10} grams Energy: { 1,-10} joules", 100, 300); Prints: Mass: 100 Energy: 300 joules

Note: salutation is a string declared as

string salutation = "Good Morning!";

OOP, Encapsulation, and System.String

graphics/common.gif

The System.String (string) class relies heavily on object-oriented principles in general, and encapsulation in particular, to provide power and ease of use to the programmer, and protects its state (the instance variables) from untimely access.

System.String is a complex class with hidden (private) methods and instance variables. Only useful methods, such as those presented in Table 7.6 are declared public by the designers of System.String, and so are made accessible.


Embedding Formatted Numbers in a string

This section should be read in conjunction with the last row of Table 7.6.

Recall how the ToString method of each numeric simple type allowed you to improve the readability and compactness of a number when converting it to a string. For example, you could write

 Console.WriteLine("Distance traveled: " + 10000000.432.ToString("N2")); 

This produces the following output:


graphics/07infig37.gif

The ToString method clearly provides you with elaborate ways of formatting one single number. However, as we realized by the end of the ToString method discussion in Chapter 6, it results in a somewhat longwinded and unclear way of combining static text with formatted numbers. Fortunately, as you saw, format specifiers come to the rescue. The path to an understanding of the format specifier begins with the well-known Console.WriteLine()method.

A Glimpse of Overloaded Methods

graphics/common.gif

Consider the following valid calls to a method Sum found close to each other in the same source program:


graphics/07infig38.gif

Notice that they have the same name but different numbers of arguments. The previous brief discussions of methods told you that the number of arguments must match the number of formal parameters specified. So why are the previous calls valid? It is possible to define different versions of the same method name, but with differing numbers and types of formal parameters. A method name of the same class and different versions is called an overloaded method. By looking at the number of arguments and their types in the method call, C# will automatically apply the matching version of the method. This is illustrated in the following with the method header and definition on the left and the method call on the right.


graphics/07infig39.gif

Letting multiple methods share the same name is a powerful feature and is used frequently in the .NET Framework class library. Chapter 12, "Class Anatomy Part I: Static Class Members And Method Adventures" will show you how to implement overloading in your own programs in more detail.


Console.WriteLine is an overloaded method representing many different versions. One version takes the following arguments:

 Console.WriteLine(<Format>, <Value0>, <Value1>...) 

where

  • <Format> represents static text with markers called format specifiers (see Table 7.6).

  • <Value0>, <Value1>... represents values that are to be inserted in <Format> at positions indicated by the format specifiers.

A format specifier is indicated by curly brackets ({}). It must always contain an index that references one of the subsequent <Value>s and specifies the value to be inserted at its location of the string. The following call to the Console.WriteLine method


graphics/07infig40.gif

prints the following to the console:

 Mass: 100  Distance: 50  Energy:  35 

As an option, the format specifier also lets you specify the width of the region where a <Value> is to be printed. It is indicated by putting a comma after the index followed by a number specifying the width in characters. A positive number will right-align the <Value>, a negative number will left-align the <Value>. For example,

 Console.WriteLine("Distance: { 0,10}  miles", 100); 

prints


graphics/07infig41.gif

whereas, the following


graphics/07infig42.gif

prints


graphics/07infig43.gif

Finally, you can determine the format of the value by including an optional format character in the format specifier that can be followed by the optional precision specifier. The format character must follow a semicolon. It has exactly the same meaning as when used with the ToString method. For example, you can write

 Console.WriteLine("Distance: { 0,-12:E2}   Mass: { 1,-12:N} ", 100000000, 5000000); 

which generates the following output:

 Distance: 1.00E+008    Mass: 5,000,000.00 

Tip

graphics/bulb.gif

To print strings with curly brackets on the console, simply omit the <Value0>, <Value1> after the static text of the Console.WriteLine call. The C# compiler will choose another version (due to overloading) of Console.WriteLine that ignores {} brackets and prints them as is. For example,

 Console.WriteLine("Number { 0}  is black, number { 1}  is red"); 

prints the following

 Number { 0}  is black, number { 1}  is red 

on the console.


The Immutability of strings

graphics/common.gif

When a string object has been created, the set of characters it represents cannot be modified. For that reason, the string is said to be immutable. Consequently, none of the string methods will modify an existing string; instead, they create and return a new modified string. For example, if you want to change the characters of a string to uppercase by using the string method ToUpper, you cannot utilize the following procedure:


graphics/07infig44.gif

There are two problems with this latter statement. First, myText.ToUpper() returns a string, but none is being assigned or otherwise utilized. Additionally, ToUpper() does not change the string of myText to uppercase characters but, instead, returns a string containing the uppercase characters corresponding to the string of myText.

A successful conversion of myText to uppercase can be done as follows:


graphics/07infig45.gif

To understand why this is correct, recall why count = count + 1; works, as discussed earlier in this chapter. The right side of the assignment operator is processed first and, when finished, the result is assigned to myText. This same procedure can be utilized for many of the string methods.


Constructing strings with System.Text.StringBuilder

graphics/bulb.gif

The immutability of the string class can sometimes lead to inconvenient and inefficient code when you are constructing longer strings from many small fragments of text because you will have to write lines like the following (where myText and newText are of type string):

 myText = myText + newText; 

The StringBuilder class located in the System.Text namespace offers an elegant and faster alternative to the string class, during string construction, because the text it represents can be updated. This is illustrated in Listing 7.7. Line 2 allows you to write shortcuts to classes in the System.Text namespace. Line 9 creates a new StringBuilder instance containing the text "Once upon a time" and assigns its reference to fairyTaleBeingWritten. Lines 10 and 11 append new text to this instance. Once constructed, fairyTaleBeingWritten's text is transferred to the string finalFairyTale in line 12.

StringBuilder contains other useful methods. You can explore them further in the .NET Framework documentation.


Listing 7.7 Source code of StringMaker.cs
01: using System;  02: using System.Text; 03: 04: class StringMaker 05: { 06:     public static void Main() 07:     { 08:         string finalFairyTale; 09:         StringBuilder fairyTaleBeingWritten = new StringBuilder("Once upon a time"); 10:         fairyTaleBeingWritten.Append(" a beautiful princess"); 11:         fairyTaleBeingWritten.Append(" lived in a big castle"); 12:         finalFairyTale = fairyTaleBeingWritten.ToString(); 13:         Console.WriteLine(finalFairyTale); 14:     } 15: }  Once upon a time a beautiful princess lived in a big castle 

Working with strings

At the time of writing, the string class contains approximately fifty methods, a selected few of which are presented in Table 7.6. All methods are explained in the .NET Framework Reference and for this reason, I will not attempt an exhaustive presentation of each and every method here. To demonstrate the usability of the string methods, I have provided a few elaborate text-analyzing source code examples. The examples can be viewed as simple precursors to important components, such as spell checkers, thesauruses, word and character counters, and so on, found in all professional word processors like MS-Word. However, the underlying logic behind the presented algorithms are applied in many other contexts.

Accessing and Analyzing Individual Characters of a string

The source code in Listing 7.8 performs an analysis of a given text entered by the user. Specifically, it prints the following statistics on the console.

  1. The number of vowels

  2. The number of consonants

  3. The number of letters

  4. The number of digits

  5. The number of words

  6. The average word length

Designing a Text-Analyzing Algorithm

The text provided by the user is stored in a variable of type string.

To determine the previous first four points, you must access and evaluate each individual character of the given text.

Before scrutinizing each character, you need to declare three counters: a vowel counter, a letter counter, and a digit counter. Each counter must be initialized to zero. You are now ready for the following loop:

For each individual character do the following:

If it is a vowel, increment the vowel counter by 1.

If it is a letter, increment the letter counter by 1.

If it is a digit, increment the digit counter by 1.

Because a letter can be either a vowel or a consonant, the number of consonants can be calculated with the expression

number of letters number of vowels.

To calculate point 5 (number of words), assume that the number of words is equal to the number of spaces plus one.

For example, the next line contains seven spaces and eight words.

This line contains seven spaces and eight words.

So, while we are checking each individual character to assess points 1 4, it is also useful to count the number of whitespace characters. At the end, when all characters have been accessed, you simply add one to this number to determine the number of words.

Finally, the average word length can be found by calculating

 (("number of letters" + "number of digits") / "number of words") 

Note that a number is also counted as a word.

The Final C# Program: TextAnalyzer.cs

You are now ready to write the C# program. It is displayed in Listing 7.8.

Note

graphics/common.gif

For the program to work correctly, you must only include one space character between each word of the text fed into the program.


Listing 7.8 Code of TextAnalyzer.cs
01: using System; 02: /* 03:  * This program will provide the following 04:  * information about a text: 05:  * The number of vowels 06:  * The number of consonants 07:  * The number of digits 08:  * The number of words 09:  * The average word length 10:  */ 11: class TextAnalyzer 12: { 13:     public static void Main() 14:     { 15:         string myText; 16:         int numVowels = 0; 17:         int numLetters = 0; 18:         int numDigits = 0; 19:         int numWhitespaceChars = 0; 20:         int numWords = 0; 21:         char ch; 22:         int index = 0; 23:  24:         Console.WriteLine("Please enter text to be analyzed:"); 25:         myText = Console.ReadLine(); 26:         myText = myText.ToUpper(); 27: 28:         while (index < myText.Length) 29:         { 30:             ch = myText[index]; 31:              //All letters have been converted to uppercase 32:              //so only need to check for uppercase vowels. 33:             if(ch == 'A' || ch == 'E' || ch == 'I' || 34:                  ch == 'O' || ch == 'U' || ch == 'Y') 35:                 numVowels++; 36:             if(char.IsLetter(ch)) 37:                 numLetters++; 38:             if(char.IsDigit(ch)) 39:                 numDigits++; 40:             if(char.IsWhiteSpace(ch)) 41:                 numWhitespaceChars++; 42:             index++; 43:         } 44: 45:         numWords = numWhitespaceChars + 1; 46:  47:         Console.WriteLine("Text analysis:"); 48:         Console.WriteLine("Number of vowels: { 0:N0} ", numVowels); 49:         Console.WriteLine("Number of consonants: { 0:N0} ", 50:             (numLetters - numVowels)); 51:         Console.WriteLine("Number of letters: { 0:N0} ", numLetters); 52:         Console.WriteLine("Number or digits: { 0:N0} ", numDigits); 53:         Console.WriteLine("Number of words: { 0:N0} ", numWords); 54:         Console.WriteLine("Average word length: { 0:N2} ", 55:             ((numLetters + numDigits) / (float)numWords)); 56:     } 57: } Please enter text to be analyzed: Now go, write it before them in a table, and note it in a book. Isaiah 30:8<enter> Text analysis: Number of vowels: 25 Number of consonants: 27 Number of letters: 52 Number of digits: 3 Number of words: 17 Average word length: 3.24 

Lines 16 19 contain counters of the vowels, letters, digits, and white space characters; they are all initialized to 0.

The index variable declared in line 22 is used to specify the individual character accessed.

Because the string is mutable it is not possible to modify the string of characters inside a particular string. However, it is possible to let it return a modified string. In this case, the original string of myText returns a set of corresponding uppercase characters that are assigned back to myText in line 25. The end result is that all letters of myText are converted to uppercase letters.

Lines 28, 29, 42 and 43 represent the important ingredients for the while loop starting in line 28. The while loop is repeated exactly myText.Length times for the following reasons:

  • index has a starting value of 0 when the program execution reaches line 28 for the first time.

  • index is incremented by 1 in line 42 every time the loop is repeated.

  • According to line 28, the while loop is repeated as long as index is less than myText.Length.

The while loop block begins at line 30 and ends at line 42.

The character with position index is accessed and assigned to the ch variable in line 30. Keeping the while loop just described in mind, the program is accessing every single character one after the other.

Lines 33-41 examine the ch variable.

The parallel vertical lines || in lines 33 and 34 symbolize the logic operator "OR". || is discussed further in Chapter 8, "Flow Of Control Part I: Branching Statements And Related Concepts." For now, you can regard lines 33 and 34 as saying "If ch holds one of the 6 vowels ('A', 'E', 'I', 'O', 'U', 'Y'), increment numVowels by one in line 35."

Lines 36-37 use the static method of char called IsLetter to determine whether ch is a character. If this is, the case numLetters is incremented by one.

Lines 38 and 39 use another of char's static methods, IsDigit, to determine whether ch is a digit. If this is the case, increment numDigits by one.

Lines 40 and 41 use the static method of char called IsWhiteSpace to determine whether ch is a space; in which case, numWhiteSpaceChars is incremented by one.

Notice that none of the punctuation characters (, ., ;, :, and so on) are counted by any of the mentioned counters.

When execution has reached line 45, numWhitspaceChars is equal to the total number of spaces in the text. But, because the number of words is equal to the number of spaces + 1, we must add 1 to numWhitespaceChars to calculate numWords.

Lines 47 55 print the results of the various counts to the console.

Sorting strings in Alphabetical Order

Electronic address books, telephone books, encyclopedias, and dictionaries all represent items relying on the ability to sort a list of words in alphabetical order.

This section presents a simple program for sorting just three words in alphabetical order. Its underlying algorithm is a precursor to the famous sorting algorithm called Bubble Sort, which is presented in Chapter 11, "Arrays Part II: Multidimensional Arrays. Searching And Sorting Arrays". After you understand the underlying logic for these algorithms, you will be able to sort not only words, but also any set of elements that are comparable, such as numbers, dates, and so on.

At the heart of most conventional word sorting programs lays the ability to compare just two words and detect which of these two words comes first in the alphabet. This ability allows us to sort any length of strings by using more-or-less sophisticated sorting algorithms.

The string Method CompareTo()

The built-in string method CompareTo (see Table 7.6) provides us with an ability to compare the Unicode values of the characters constituting two strings. Thus, if we restrict our strings to contain either uppercase letters only or lowercase letters only (see the following Note), we can perform a valid lexicographic (using alphabetic sequence) comparison.

Note

graphics/common.gif

A comparison between two characters in the CompareTo method is made by comparing the Unicode values of the two characters. The list of Unicode values found in Appendix E reveals that any uppercase letter has a smaller Unicode value than any lowercase letter. However, within the uppercase and lowercase letters respectively, the Unicode values are sequenced in alphabetical order. Thus, only when two strings consist entirely of uppercase or lowercase letters can a reliable lexicographic comparison be performed.

For example,

"W" is less than "w"

"Z" is less than "a"

"stormy Weather" is less than "stormy weather"


Note

graphics/common.gif

Comparing for equality as in string1 == string2, does not help us to sort strings. We do not get the vitally important information of which string has the greatest value.


Before looking at Listing 7.10 with the final source code, we need to address a couple of issues.

A Closer Look at the CompareTo Method

The algorithm used by CompareTo when comparing two strings compares successive characters with identical index in each string, starting with the first character. The algorithm will continue this procedure until one of the following two situations occur:

  • Two corresponding characters are detected to be unequal In this case, the string containing the character (of the two corresponding characters) with the smallest Unicode value will be the smallest string.

    For example, "many" is greater than "mane" because y is greater than e.

  • The end of the shortest string is reached The shorter string will then be the smallest string.

    For example, "AAA" is greater than "AA".

Swapping the Contents of Two Variables

Swapping the values of two variables is a frequently used piece of functionality in a source program. This is utilized in Listing 7.10 as part of its sorting algorithm.

To explain the swapping process used here, consider that we want to swap the contents of the variables A and B.

A first hunch to swap the values would be to assign for example B to A. However, A is then equal to B and it is impossible to assign the original value of A to B. Thus, before we perform this first assignment, we must preserve the original value of A. To do this, we employ a temporary variable, called temp, which is assigned the value of A. Thus, we can write the following:

 temp = A; A = B; 

Now A holds the value of B, so we are half way. We now need to move the original value of A over to B. Fortunately, temp is holding the original A value, so the full swap can be accomplished by writing

 B = temp; 
Designing a Simple Sorting Algorithm

Sorting algorithms are fundamental to computer programming. Large portions of the output from today's applications are sorted one way or another. Internally, an initial sorting of data can often speed up subsequent processes. Consequently, sorting is an essential and intensely studied process that, over the years, has produced numerous ingenious, weird, wonderful, and extremely fast sorting algorithms.

Note

graphics/common.gif

The database is just one of many types of applications that rely heavily on sorting algorithms. For example, invoices are sorted in order of their invoicing date or their total amount, and customers are sorted in terms of how much they buy or by their names.


Listing 7.9 is a primitive sorting algorithm, but it is relatively simple and gives you an idea about what sorting is all about. The sorting algorithm utilized is written with pseudocode. It sorts the three values held by num1, num2, and num3 in ascending order. In this example, they initially hold the values 30, 20, and 10 (descending order), respectively. After having been processed by the sorting algorithm, their values are 10, 20 and 30 (ascending order), respectively. Consequently, after the sorting has been applied, numA will have the smallest value, numB will have the middle value, and numC will have the greatest value, regardless of the starting values.

Listing 7.9 Pseudocode for Sorting Algorithm used in Listing 7.10
01: set numA equal to 30 02: set numB equal to 20 03: set numC equal to 10 04: 05: repeat as long as: num2 and num3 were swapped in last loop (Note: first loop is  graphics/ccc.gifalways performed) 06: { 07:      if numA is greater than numB then 08:             swap numA and numB 09:      if numB is greater than numC then 10:             swap numB and numC 11: } 

Why does it work? Tracing is a good way to get an initial understanding for this algorithm, so this will be the first part of our analysis. Recall our first encounter with the alien from planet Blipos in Chapter 2. When the alien checked our average calculating algorithm, he was effectually tracing it. Tracing simply means to execute the algorithm as if you were the computer. You need to keep track of the variable values and any changes made to these as you "execute" the algorithm. Let's get started tracing Listing 7.9.

After lines 1-3, the values of the variables are

numA = 30

numB = 20

numC = 10

Line 5 indicates we have to perform the loop at least once, so let's begin.

Because numA(30) is greater than numB(20), we swap these two variables in lines 7 and 8.

The variables now hold the following values:

numA = 20

numB = 30

numC = 10

Because numB(30) is greater than numC(10), we swap these two variables in lines 9 and 10.

The variables now hold the values

numA = 20

numB = 10

numC = 30

We return to line 5. Do we need to repeat the loop? Yes, because numB and numC were swapped in the previous loop.

Because numA(20) is greater than numB(10), we swap these two variables in lines 7 and 8.

 numA(10) numB(20) numC(30) 

In lines 9 and 10, because numB is not greater than numC, we do not swap the two variables.

We return to line 5. Because numB and numC were not swapped in the last loop, we stop the algorithm and the tracing. The three numbers have finally been sorted as requested.

So, even though numA and numB are correctly sorted after lines 7 and 8, lines 9 and 10 might spoil this by swapping a value into numB from numC that is smaller than numA. Consequently, we must repeat the loop until numB and numC are not swapped. At the same time, every swap is beneficial and able to transport number 10 initially in numC up to numA and 30 down to numC. In fact, the algorithm seems to work like a pump with the ability to "pump" small numbers upward until they are blocked by even smaller numbers. This is illustrated in Figure 7.8.

Figure 7.8. A "pumping" algorithm.
graphics/07fig08.gif
The Final C# String Sorter Program

It's time to have a look at the final source code in Listing 7.10.

Note

graphics/common.gif

Notice that you must either input words with only upper- or lowercase letters to achieve a reliable alphabetical sort. This issue is discussed earlier in this section.


Listing 7.10 Code of StringSorter.cs
01: using System; 02: /* 03:  * The Main method of the StringSorter class 04:  * Reads in 3 user specified strings 05:  * and sorts them in alphabetic order. 06:  */ 07: class StringSorter 08: { 09:     public static void Main() 10:     { 11:         string string1; 12:         string string2; 13:         string string3; 14:         string tempString; 15:         bool changes = true; 16:  17:         Console.Write("Enter first string: "); 18:         string1 = Console.ReadLine(); 19:         Console.Write("Enter second string: "); 20:         string2 = Console.ReadLine(); 21:         Console.Write("Enter third string: "); 22:         string3 = Console.ReadLine(); 23:         while(changes) 24:         { 25:             changes = false; 26:             if(string1.CompareTo(string2) > 0) 27:             { 28:                  //The next 3 lines 29:                  //swap string1 and string2 30:                 tempString = string1; 31:                 string1 = string2; 32:                 string2 = tempString; 33:             } 34:             if(string2.CompareTo(string3) > 0) 35:             { 36:                  //The next 3 lines 37:                  //swap string2 and string3 38:                 tempString = string2; 39:                 string2 = string3; 40:                 string3 = tempString; 41:                 changes = true; 42:             } 43:         } 44:         Console.WriteLine("The strings in alphabetic order:"); 45:         Console.WriteLine(string1); 46:         Console.WriteLine(string2); 47:         Console.WriteLine(string3); 48:     } 49: } Enter first string: monkey<enter> Enter second string: elephant<enter> Enter third string: bird<enter> The strings in alphabetic order: bird elephant monkey 

Line 14 declares the tempString variable of type string utilized for swapping two string variables.

Line 15 declares a variable of type bool with the name changes. It is used to keep track of whether string2 and string3 were swapped in the previous loop (more details will follow). A variable, such as changes, that "remembers" whether a certain event has taken place is called a flag.

Note

graphics/common.gif

You have previously seen how a single statement can be positioned immediately after an if keyword and its associated condition as follows:

 01: if (myNumber > 10) 02:     Console.WriteLine("myNumber is greater than 10"); 03: Console.WriteLine("I'm always printed irrespective of myNumber's size"); 

But what if we want more than one statement to be executed only when the if condition myNumber > 10 is true? C# allows you to substitute a single statement with a block of statements enclosed by curly braces ({} ). All statements inside this block will be executed when the if condition is true. None of them will be executed if the condition is false:

 01: if (myNumber > 10) 02: { 03:     Console.WriteLine("myNumber is greater than 10"); 04:     Console.WriteLine("I'm also only printed if myNumber > 10 is true"); 05: } Console.WriteLine("I'm always printed"); 


The braces in lines 27 ({ ) and 33 (} ) enclose the block of statements to be executed if the condition (string1.CompareTo(string2) > 0) of the if statement is true. Consequently, if this condition is true, all the statements in lines 30 32 will be executed. On the other hand, if the condition is false, the execution will jump over this block and continue at line 34. The same logic applies for the block enclosed by the braces of lines 35 and 42.

As long as changes in line 23 is holding the value true, the while loop will be repeated. To make sure that changes is really true due to a swap of string2 with string3 in the previous loop, changes is set to false in the beginning of every loop and then later changed to true in line 41 if the swap really took place. Without line 25, changes would always stay true and the loop would repeat itself forever; we would end up with a so-called infinite loop.

In lines 26 33, if the condition (string1.CompareTo(string2) > 0) is true, it means string1 is larger than string2. Because we need to "pump" the smaller values "up," we need to swap the two values. This is performed in lines 30 32 with the help of the tempString as discussed earlier.

Lines 34 42 represent the same logic as lines 26 33 with the small addition that changes will be set to true in case of a swap.

Note

graphics/common.gif

Many algorithms and programs can be divided into three separate phases:

  1. A phase that declares and initializes the variables and constants of the program (lines 11 15 in this example).

  2. A processing phase where the user inputs information and information is acquired from various sources, such as files, networks, the Internet, and so on. The main part of the processing takes place in this phase (lines 17 43).

  3. A completion phase where final calculations are performed and the requested results provided as output (lines 44 47).


Extracting Sub-strings from a string

Instead of analyzing individual characters, as the TextAnalyzer.cs source code of Listing 7.10 does, the WordExtractor.cs source code presented in Listing 7.12 extracts individual words from a string and prints words with a specified minimum length. WordExtractor.cs is heavily dependent on the functionality of the string method SubString.

Note

graphics/common.gif

The ability to isolate and recognize individual words in a sequence of characters is used in many different types of applications:

  • Any sophisticated word-processing features, such as spell check and thesaurus facilities, rely on the ability to recognize individual words.

  • Computer language compilers (which are programs themselves) must, as one of the first steps during their compilation process, isolate the tokens existing in the source code to recognize keywords, operators, and other language elements.

  • Browsers understand HTML formatted documents by breaking down the HTML text into understandable parts.

  • The whole XML-revolution would not be possible without programs that can break XML-documents into smaller word-resembling parts.

  • Search engines like Yahoo and AltaVista must isolate and recognize the words you ask them to search for in the millions of documents they access on the Internet.


Designing a Word Extraction Algorithm

To construct an algorithm for extracting words, we must first define what a word is. In this case, I have defined a word to be any of the following three definitions:

  1. A sequence of non-whitespace characters surrounded by whitespace characters on the left and the right. For example, in "The monkey is funny+," "monkey" and "is" can be regarded as words according to this definition, because they are both surrounded by whitespace characters. "The" does not have a whitespace character on its left side, and "funny" lacks whitespace on its left side.

  2. A sequence of non-whitespace characters surrounded by:

    • The beginning of the text on the left

    • Whitespace character on the right

    Consequently, in "The monkey is funny," "The" can be defined as a word.

  3. A sequence of non-whitespace characters surrounded by:

    • A whitespace character on the left

    • The end of the text on the right

    As a result, in "The monkey is funny," "funny" is defined as a word.

By using these definitions, we have included all the sub-strings of a text that we conventionally refer to as words. Equipped with a definition for what our algorithm is looking for, we can begin to form an idea of what it could look like.

The basic strategy utilized in the algorithm presented here employs two variables of type int that are used as text markers. We will call these two variables wordBegin and wordEnd. A number contained in any of these two markers refers to an index of a string and, thereby, an individual character. To illustrate how these markers can access important information in our text, consider the following assignment statement:

 string myText = "My dog is white"; 

Recall that we can view myText as a collection of indexed characters as follows:

 0   1   2   3   4   5   6   7   8   9   10   11   12   13   14 M   y       d   o   g       i   s       w    h    i    t    e 

By assigning suitable values to the two markers wordBegin and wordEnd, we can now

  • Access individual characters with the character access feature (using square brackets [] ) of string, as in the following:

     char ch; wordEnd = 10; ch = myText[wordEnd]; 

    which assigns the letter 'w' to the character variable ch.

  • Access sub-strings enclosed by the two markers by using the SubString method of the string class. wordBegin indicates the beginning of a sub-string; wordEnd indicates the end. For example,

     int wordLength; wordBegin = 3; wordEnd = 6; wordLength = wordEnd   wordBegin; Console.WriteLine(myText.SubString(wordBegin, wordLength); 

    prints the text "dog" to the console.

By making sure that wordBegin always points at the first character of a word, it is possible to let wordEnd move ahead character by character from the position of wordBegin and "look" for the end of the word. This is illustrated in Figure 7.9. According to our previous definitions of a word, wordEnd has found the end of a word whenever any of the two following events take place:

  • wordEnd points at a whitespace character.

  • wordEnd reaches the end of the text.

Figure 7.9. Three important stages when finding the next word in a string.
graphics/07fig09.gif

A newly detected word can then be extracted with the SubString method because, at this particular moment, we know that wordBegin and wordEnd mark the beginning and end of a word.

The last step of one cycle is to move the wordBegin marker to the first character of the next word. This can be achieved by assigning the wordEnd index number plus one to wordBegin. To realize why this makes sense, observe that when wordEnd is detecting a word it is pointing at the space just before the next word. Notice that the latter only holds if the end of the text has not been reached. In that case, however, the algorithm is stopped before an attempt is made to move wordBegin.

Listing 7.11 illustrates with pseudocode the algorithm conceived on the basis of the previous discussion. Notice that it includes the ability to print only detected words that are longer than a certain specified minimum length. This is specified in lines 9 and 14.

Listing 7.11 Pseudocode for WordExtractor Algorithm
01: Initialize the following variables to 0: wordBegin, wordEnd, wordLength. 02: User sets minLenght variable. (the minimum length of a word to be printed) 03: Repeat the following block as long as wordEnd has not reached the end of string. 04: { 05:       move wordEnd one character forward. 06:       if wordEnd is pointing at a whitespace character then execute the following  graphics/ccc.gifblock 07:       { 08:            // A new word has been found enclosed by wordBegin and wordEnd 09:          if new word length is same length or longer than minLength then print the  graphics/ccc.gifword to the console. 10:          Set wordBegin to point at the first character of the next word. 11:       } 12: } 13: wordEnd has reached the end of the string so wordBegin and wordEnd now encloses the  graphics/ccc.giflast word of the string. 14: If the last word is the same length or longer than minLength then print the word on  graphics/ccc.gifthe console. 15: End 

In line 1, setting wordBegin and wordEnd to zero sets both of them to point at the first index of the given string, as illustrated in stage 1 of Figure 7.9.

Line 3 causes the block between the curly brackets of lines 4 and 12 to be repeated as long as wordEnd is not at the end of the string.

Line 5 is the first step of stage 2 in Figure 7.9. wordEnd moves forward character by character. When combined with Lines 3, 4, and 12, line 5 causes wordEnd to point at every single character of the text one by one.

When the condition in line 6 is true, stage 2 of Figure 7.9 has been reached and the execution of the block enclosed by lines 7 and 11 commences.

Line 9 ensures that only words with a minimum length specified by the user in minLength are printed to the console.

Line 10 is related to stage 3 of Figure 7.9. wordBegin jumps forward to point at the beginning of the next word.

Lines 13 16 are self-explanatory.

The Final C# Word Extractor Program

The C# code of WordExtractor.cs can now be written. It is displayed in Listing 7.12.

Listing 7.12 Code of WordExtractor.cs
01: using System; 02: /* 03:  * Locates words in a given text with a minimum 04:  * number of characters and prints those 05:  * on the console. 06:  * The text and the minimum word length is 07:  * provided as input from the user. 08:  */ 09: public class WordExtractor 10: { 11:     public static void Main() 12:     { 13:         string myText; 14:         int minLength = 0; 15:         int wordBegin = 0; 16:         int wordEnd = 0; 17:         int wordLength = 0; 18:         int adjTextLength = 0; 19:         char ch; 20: 21:         Console.WriteLine("Enter text:"); 22:         myText = Console.ReadLine(); 23:         Console.WriteLine("Enter minimum length" + 24:             " of words to be displayed"); 25:         minLength = Convert.ToInt32(Console.ReadLine()); 26:          //Adjust text length to match zero based index 27:         adjTextLength = myText.Length - 1; 28: 29:          //Repeat while-block as long as end of text 30:          //has not been reached 31:         while(wordEnd < adjTextLength) 32:         { 33:             wordEnd++; 34:             ch = myText[wordEnd]; 35:              //If ch is whitespace character then new word 36:              //has been found. wordBegin then indicates 37:              //beginning of word and wordEnd the end of the word 38:             if(char.IsWhiteSpace(ch)) 39:             { 40:                  //Write word on console if number of characters is 41:                  //the same length or longer than minLength 42:                 wordLength = wordEnd - wordBegin; 43:                 if(wordLength >= minLength) 44:                 { 45:                     Console.WriteLine 46:                     (myText.Substring(wordBegin, wordLength)); 47:                 } 48:                  //Jump over whitespace and begin 49:                  //at first character of next word 50:                 wordBegin = wordEnd + 1; 51:             } 52:         } 53:  54:          //Write out last word of myText if number of 55:          //characters is greater than or equal to minLength 56:         wordLength = wordEnd - wordBegin + 1; 57:         if(wordLength >= minLength) 58:         { 59:             Console.WriteLine 60:             (myText.Substring(wordBegin, wordEnd - wordBegin + 1)); 61:         } 62:     } 63: } Enter text: When he who hears does not know what he who speaks means and when he who speaks does not  graphics/ccc.gifknow what he himself means that is philosophy  Voltaire<enter> Enter minimum length of words to be displayed 5<enter> hears speaks means speaks himself means philosophy Voltaire 

The block enclosed by lines 32 and 52 is equivalent to the block enclosed by lines 4 and 12 in Listing 7.11. wordEnd must start at zero (achieved by line 15) and be incremented by one (line 33) until it reaches the last index of myText (achieved by lines 27 and 31). If we mistakenly used (wordEnd < myText.Length) as the condition in line 31, the loop would repeat itself one times too many, because the underlying string index of myText is zero based. (It commences at zero.) Instead we use adjTextLength which is equal to myText.Length - 1.

The character of the current index held by wordEnd is assigned to the ch variable of type char in line 34.

If ch holds a whitespace character when line 38 is executed, a new word has been found and the statements between the curly brackets of lines 39 and 51 are executed.

Lines 43 47 print the detected word (by using the SubString method of the string class in line 46) if its length is the same or longer than specified in minLength.

When the while loop spanning lines 31 52 terminates, wordEnd has reached the end of the string. wordLength of the last word is calculated in line 56. wordEnd does not, at this moment, point at the space after the last word (there is no space) but at the last character of the word. Thus, the word length calculated by wordEnd - wordBegin must be adjusted by adding 1.

If the word length of the last word is longer than or equal to the minimum length specified by minLength, then lines 59-60 print this word to the console.


   


C# Primer Plus
C Primer Plus (5th Edition)
ISBN: 0672326965
EAN: 2147483647
Year: 2000
Pages: 286
Authors: Stephen Prata

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net