String Handling | Performance Consulting: A Practical Guide for HR and Learning Professionals

Chapter 5 - C# and the Base Classes

bySimon Robinsonet al.
Wrox Press 2002

Since Chapter 2, we have been almost constantly using strings, and have taken for granted the stated mapping that the string keyword in C# actually refers to the .NET base class System.String . System.String is a very powerful and versatile class, but it is not by any means the only string- related class in the .NET armory. In this section, we start off by reviewing the features of System.String , and then we will go on to have a look at some quite nifty things you can do with strings using some of the other .NET classes in particular those in the System.Text and System.Text.RegularExpressions namespaces. We will cover the following areas:

Building Strings If you're performing repeated modifications on a string, for example in order to build up a lengthy string prior to displaying it or passing it to some other method or software, the String class can be very inefficient. For this kind of situation, another class, System.Text.StringBuilder is more suitable, since it has been designed exactly for this situation.
Formatting Expressions We will also take a closer look at those formatting expressions that we have been using in the Console.WriteLine() method throughout these last few chapters. These formatting expressions are processed using a couple of useful interfaces, IFormatProvider and IFormattable , and by implementing these interfaces on your own classes, you can actually define your own formatting sequences so that Console.WriteLine() and similar classes will display the values of your classes in whatever way you specify.
Regular Expressions .NET also offers some very sophisticated classes that deal with the situation in which you need to identify or extract substrings that satisfy certain fairly sophisticated criteria from a long string. By sophisticated, I mean situations such as needing to find all occurrences within a string where a character or set of characters is repeated, or needing to find all words that begin with 's' and contain at least one 'n'. Although you can write methods to perform this kind of processing using the string class, such methods are cumbersome to write. Instead, you can use some classes from System.Text.RegularExpressions , which are designed specifically to perform this kind of processing.

System.String

Before we examine the other string classes, we will quickly review some of the available methods on the String class.

System.String is a class that is specifically designed to store a string, and allow a large number of operations on the string. Not only that, but because of the importance of this data type, C# has its own keyword and associated syntax to make it particularly easy to manipulate strings using this class.

You can concatenate strings using operator overloads:

   string message1 = "Hello";     message1 += ", There";     string message2 = message1 + "!";

C# also allows extraction of a particular character using an indexer-like syntax:

   char char4 = message[4];   // returns 'a'. Note the char is zero-indexed

There are also a large number of methods to perform such common tasks as replacing characters, removing whitespace, and capitalization. The available methods include:

Method	Purpose
Compare	Compares the contents of strings, taking into account the culture (locale) in assessing equivalence between certain characters
CompareOrdinal	As Compare , but doesn't take culture into account
Format	Formats a string containing various values and specifiers for how each value should be formatted
IndexOf	Locates the first occurrence of a given substring or character in the string
IndexOfAny	Locates the first occurrence of any one of a set of characters in the string
LastIndexOf	As for IndexOf , but finds the last occurrence
LastIndexOfAny	As for IndexOfAny , but finds the last occurrence
PadLeft	Pads out the string by adding a specified repeated character to the beginning of it
PadRight	Pads out the string by adding a specified repeated character to the end of it
Replace	Replaces occurrences of a given character or substring in the string with another character or substring
Split	Splits the string into an array of substrings, the breaks occurring wherever a given character occurs
Substring	Retrieves the substring starting at a specified position in the string
ToLower	Converts string to lowercase
ToUpper	Converts string to uppercase
Trim	Removes leading and trailing whitespace

That this table is not comprehensive, but is intended to give you an idea of the features offered by strings.

Building Strings

As we have seen, String is an extremely powerful class that implements a large number of very useful methods. However, String has a problem that makes it very inefficient for making repeated modifications to a given string it is actually an immutable data type, which is to say that once you initialize a string object, that string object can never change. The methods and operators that appear to modify the contents of a string actually create new strings, copying the contents of the old string over if necessary. For example, look at the following code:

   string greetingText = "Hello from all the guys at Wrox Press. ";     greetingText += "We do hope you enjoy this book as much as we enjoyed     writing it.";

What happens when this code executes is this: first, an object of type System.String is created and initialized to hold the text " Hello from all the people at Wrox Press. " Note the space after the full stop. When this happens, the .NET runtime will allocate just enough memory in the string to hold this text (39 chars), and we set the variable greetingText to refer to this string instance.

In the next line, syntactically it looks like we're adding some more text onto the string we are not. Instead, we create a new string instance, with just enough memory allocated to store the combined text that's 103 characters in total. The original text, " Hello from all the people at Wrox Press. ", is copied into this new string along with the extra text, " We do hope you enjoy this book as much as we enjoyed writing it. "T hen, the address stored in the variable greetingText is updated, so the variable correctly points to the new String object. The old String object is now unreferenced there are no variables that refer to it and so will be removed the next time the garbage collector comes along.

By itself, that doesn't look too bad, but suppose we wanted to encode that string by replacing each letter (not the punctuation) with the character which has an ASCII code one further on in the alphabet, as part of some extremely simple encryption scheme. This would turn the string to " Ifmmp gspn bmm uif hvst bu Xspy Qsftt. Xf ep ipqf zpv fokpz uijt cppl bt nvdi bt xf fokpzfe xsjujoh ju." There are several ways of doing this, but the simplest and (if you are restricting yourself to using the String class) almost certainly the most efficient way is to use the String.Replace() method, which replaces all occurrences of a given substring in a string with another substring. Using Replace() , the code to encode the text would look like this:

 string greetingText = "Hello from all the guys at Wrox Press. "; greetingText += "We do hope you enjoy this book as much as we enjoyed writing it.";   for(int i = (int)'z'; i>=(int)'a' ; i--)     {     char old = (char)i;     char new = (char)(i+1);     greetingText = greetingText.Replace(old, new);     }     for(int i = (int)'Z'; i>=(int)'A' ; i--)     {     char old = (char)i;     char new = (char)(i+1);     greetingText = greetingText.Replace(old, new);     }     Console.WriteLine("Encoded:\n" + greetingText);

For simplicity, this code doesn't wrap Z to A or z to a . These letters get respectively encoded to [ and { .

How much memory do you think we needed to allocate in total to perform this encoding? Replace() works in a fairly intelligent way, to the extent that it won't actually create a new string unless it does actually make some changes to the old string. Our original string contained 23 different lowercase characters and 3 different uppercase ones. Replace() will therefore have allocated a new string 26 times in total, each new string storing 103 characters. That means that as a result of our encryption process there will be string objects capable of storing a combined total of 2,678 characters now sitting on the heap waiting to be garbage-collected ! Clearly, if you use strings to do text processing extensively, your applications will run into severe performance problems.

It is in order to address this kind of issue that Microsoft has supplied the System.Text.StringBuilder class. StringBuilder isn't as powerful as String in terms of the number of methods it supports. The processing you can do on a StringBuilder is limited to substitutions and appending or removing text from strings. However, it works in a much more efficient way.

Whereas when you construct a string, just enough memory gets allocated to hold the string, the StringBuilder will normally allocate more memory than needed. You have the option to explicitly indicate how much memory to allocate, but if you don't, then the amount will default to some value that depends on the size of the string that StringBuilder is initialized with. It has two main properties:

Length The length of the string that it actually contains
Capacity How long a string it has allocated enough memory to store

Any modifications to the string take place within this block of memory, which makes appending substrings and replacing individual characters within strings very efficient. Removing or inserting substrings is inevitably still inefficient, because it means that the following part of the string has to be moved. Only if you perform some operation that exceeds the capacity of the string will new memory need to be allocated and the entire contained string possibly moved. At the time of writing, Microsoft has not documented how much extra capacity will be added, but from experiments the StringBuilder appears to approximately double its capacity if it detects the capacity has been exceeded and no new value for the capacity has been explicitly set.

As an example, if we use a StringBuilder object to construct our original greeting string, we might write this code:

   StringBuilder greetingBuilder =     new StringBuilder("Hello from all the guys at Wrox Press. ", 150);     greetingBuilder.Append("We do hope you enjoy this book as much as we enjoyed     writing it");

In order to use the StringBuilder class, you will need a System.Text reference in your code.

In this code, we have set an initial capacity of 150 for the StringBuilder . It is always a good idea to set some capacity that covers the likely maximum length of string, to ensure the StringBuilder doesn't need to relocate because its capacity was exceeded. Theoretically, you can set as large a number as it is possible to pass in an int for the capacity, though the system will probably complain that it doesn't have enough memory if you try to actually allocate the maximum of 2 billion characters (this is the theoretical maximum that a StringBuilder instance is in principle allowed to contain).

When the above code is executed, we first create a StringBuilder object that initially looks like this:

Then, on calling the Append() method, the remaining text is placed in the empty space, without needing to allocate any more memory. However, the real efficiency gain from using a StringBuilder comes when we are making repeated text substitutions. For example, if we try to encrypt the text in the same way as before, then we can perform the entire encryption without allocating any more memory whatsoever:

 StringBuilder greetingBuilder =    new StringBuilder("Hello from all the guys at Wrox Press. ", 150); greetingBuilder.Append("We do hope you enjoy this book as much as we enjoyed                         writing it");   for(int i = (int)'z'; i>=(int)'a' ; i--)     {     char old = (char)i;     char new = (char)(i+1);     greetingBuilder = greetingBuilder.Replace(old, new);     }     for(int i = (int)'Z'; i>=(int)'A' ; i--)     {     char old = (char)i;     char new = (char)(i+1);     greetingBuilder = greetingBuilder.Replace(old, new);     }     Console.WriteLine("Encoded:\n" + greetingBuilder.ToString());

This code uses the StringBuilder.Replace() method, which does the same thing as String.Replace() , but without copying the string in the process. The total memory allocated to hold strings in the above code is 150 for the builder, as well as the memory allocated during the string operations performed internally in the final Console.WriteLine() statement.

Normally, you will use StringBuilder to perform any manipulation of strings, and String to store or display the final result.

StringBuilder Members

We have demonstrated one constructor of StringBuilder , which takes an initial string and capacity as its parameters. There are also several others. Among them, you can supply only a string:

   StringBuilder sb = new StringBuilder("Hello");

or create an empty StringBuilder with a given capacity:

   StringBuilder sb = new StringBuilder(20);

Apart from the Length and Capacity properties we have mentioned, there is a read only MaxCapacity property, which indicates the limit to which a given StringBuilder instance is allowed to grow. By default, this is given by int.MaxValue ( roughly 2 billion, as noted earlier), but you can set this value to something lower when you construct the StringBuilder object if you wish:

   // These will both set initial capacity to 100, but the max will be 500.     // Hence, these StringBuilders can never grow to more than 500 characters,     // they will raise exception if you try to do that.     StringBuilder sb = new StringBuilder("Hello", 100, 500);     StringBuilder sb = new StringBuilder(100, 500);

You can also freely explicitly set the capacity at any time, though an exception will be raised if you set it to a value less than the current length of the string, or which exceeds the maximum capacity:

   StringBuilder sb = new StringBuilder("Hello");     sb.Capacity = 100;

The main StringBuilder methods available include:

Method	Purpose
Append()	Appends a string to the current string
AppendFormat()	Appends a string that has been worked out from a format specifier
Insert()	Inserts a substring into the current string
Remove()	Removes characters from the current string
Replace()	Replaces all occurrences of a character by another character or a substring with another substring in the current string
ToString()	Returns the current string cast to a System.String object (overridden from System.Object )

There are several overloads of many of these methods available.

AppendFormat() is actually the method that is ultimately called when you call Console.WriteLine() , which has responsibility for working out what all the format expressions like {0:D} should be replaced with. We will examine this method in the next section.

At the time of writing, there is no cast (either implicit or explicit) from StringBuilder to String . If you want to output the contents of a StringBuilder as a String , the only way to do so is through the ToString() method.

Format Strings

So far, we have written a large number of classes and structs for the code samples presented in this book, and we have normally implemented a ToString() method for each of these in order to be able to quickly display the contents of a given variable. However, quite often there are a number of possible ways that users might want the contents of a variable to be displayed, and often these are culture- or locale-specific . The .NET base class, System.DateTime provides the most obvious example of this. Ways that you might want to display the same date include 14 February 2002, 14 Feb 2002, 2/14/02 (in the USA, at least in the UK, this would be written 14/2/02), or of course in Germany you'd write 14. Februar 2002, and so on.

Similarly, for our Vector struct that we wrote in Chapter 3, we implemented the Vector.ToString() method to display the vector in the format (4, 56, 8) . There is, however, another very common way of writing vectors, in which this vector would appear as 4i + 56j + 8k . If we want the classes that we write to be user -friendly, then they need to support the facility to display their string representations in any of the formats that users are likely to want to use. The .NET runtime defines a standard way that this should be done: using an interface, IFormattable . Showing how to add this important feature to your classes and structs is the subject of this section.

The most obvious time that you need to specify the format in which you want a variable displayed is when you call Console.WriteLine() . Therefore, we are going to use this method as an example, although most of our discussion applies for any situation in which you wish to format a string. If for example, you wish to display the value of a variable in a listbox or textbox, you will normally use the String.Format() method to obtain the appropriate string representation of the variable, but the actual format specifiers you use to request a particular format are identical to those passed to Console.WriteLine() , and as we will see in this section, the same underlying mechanism is used. Hence, we will focus on Console.WriteLine() as an example. We start by examining what actually happens when you supply a format string to a primitive type, and from this we will see how we can plug in format specifiers for our own classes and structs into the process.

Recall from Chapter 2 that we use format strings in Console.Write() and Console.WriteLine() like this:

   double d = 13.45;     int i = 45;     Console.WriteLine("The double is {0,10:E} and the int contains {1}", d, i);

The format string itself consists mostly of the text to be displayed, but wherever there is a variable to be formatted, its index in the parameter list appears in braces. The may be other information inside the brackets concerning the format of that item:

The number of characters to be occupied by the representation of the item can appear; this information will be prefixed by a comma. A negative number indicates that the item should be left justified, while a positive number indicates that it should be right justified. If the item actually occupies more characters than have been requested , it will still appear in full.
A format specifier can also appear. This will be preceded by a colon , and indicates how we wish the item to be formatted. For example, do we want a number to be formatted as a currency, or displayed in scientific notation?

We covered the common format specifiers for the numeric types in brief in Chapter 2. Here is the table again for reference:

Specifier	Applies to	Meaning	Example
C	numeric types	locale-specific monetary value	$4834.50 (USA)4834.50 (UK)
D	integer types only	general integer	4834
E	numeric types	scientific notation	4.834E+003
F	numeric types	fixed point decimal	4384.50
G	numeric types	general number	4384.5
N	numeric types	usual locale specific format for numbers	4,384.50 (UK/USA)4 384,50 ( continental Europe)
P	numeric types	Percentage notation	432,000.00%
X	integer types only	hexadecimal format	1120 (NB. If you want to display 0x1120, you'd need to write out the 0x separately)

If you want an integer to be padded with zeros, you can use the format specifier (zero) repeated the required number of times. For example, the format specifier 0000 will cause 3 to be displayed as 0003 , and 99 to be displayed as 0099 , and so on.

It is not possible to give a complete list, because other data types can add their own specifiers. Showing how to define our own specifiers for our own classes is the aim of this section.

How the String is Formatted

As an example of how formatting of strings works, we will see what happens when the following statement is executed:

   Console.WriteLine("The double is {0,10:E} and the int contains {1}", d, i);

In fact, Console.WriteLine() just hands the entire set of parameters straight over to the static method, String.Format() the same method that you'd call if you wanted to format these values for use in a string to be used in some other way, such being displayed in a textbox. With the usual provisos about it being impossible to verify what the actual sourcecode for this method really is, the implementation of the 3-parameter overload of WriteLine() basically does this:

   // Likely implementation of Console.WriteLine()     public void WriteLine(string format, object arg0, object arg1)     {     Console.WriteLine(string.Format(format, arg0, arg1));     }

The one-parameter overload of this method, which is in turn getting called above, simply writes out the contents of the string it has been passed, without doing any further formatting on it.

String.Format() now needs to construct the final string by replacing each format specifier by a suitable string representation of the corresponding object. However, as we saw earlier, this kind of process of building up a string is exactly the situation in which we really need a StringBuilder instance rather than a string instance, and that's exactly what happens. For the particular example we are using here, a StringBuilder instance will be created and initialized with the first known portion of the string, the text " The double is ". The StringBuilder.AppendFormat() method will then be called, passing in the first format specifier, {0,10:E} , and the associated object, the double , in order to add the string representation of this object to the string being constructed , and this process will continue with StringBuilder.Append() and StringBuilder.AppendFormat() being called repeatedly until the entire formatted string has been obtained.

Now comes the interesting part, because StringBuilder.AppendFormat() will need to figure out how to actually format the object. The first thing it will do is probe the object to find out whether it implements an interface in the System namespace called IFormattable . You can find this out quite simply by trying to cast an object to this interface and seeing whether the cast succeeds, or by using the C# is keyword. If this test fails, then AppendFormat() will simply call the object's ToString() method, which all objects either inherit from System.Object or override. In the cases of all the classes and structs we have written so far, this is what will happen, since so far, none of the classes we have written have implemented this interface. That is why our overrides of Object.ToString() have been sufficient to allow our structs and classes from earlier chapters such as Vector to get displayed in Console.WriteLine() statements.

However, all of the predefined primitive numeric types do implement this interface, which means that for those types, and in particular for the double and the int in our example, the basic ToString() method inherited from System.Object will not be called. To understand what happens instead, we need to examine the IFormattable interface.

IFormattable defines just one method, which is also called ToString() . However, this method takes two parameters as opposed to the System.Object version, which doesn't takes any. This is the definition of IFormattable :

   interface IFormattable     {     string ToString(string format, IFormatProvider formatProvider);     }

The first parameter that this overload of ToString() expects is a string that specifies the requested format. In other words, it is the specifier portion of the string that appears in the {} in the string originally passed to Console.WriteLine() or String.Format() . For example, in our example the original statement was:

   Console.WriteLine("The double is {0,10:E} and the int contains {1}", d, i);

Hence, when evaluating the first specifier, {0,10:E} , this overload will be called against the double variable, d , and the first parameter passed to it will be E . What StringBuilder.AppendFormat() will pass in here is always whatever text appears after the colon in the appropriate format specifier from the original string.

We won't worry about the second parameter to ToString() in this book. It is a reference to an object that implements the interface IFormatProvider . This interface gives further information that ToString() may need to consider when formatting the object, most notably including details of a culture to be assumed (recall that a .NET culture is similar to a Windows locale; if you are formatting currencies or dates then you need this information). If you are calling this ToString() overload directly from your sourcecode, you may wish to supply such an object. However, StringBuilder.AppendFormat() passes in null for this parameter. If formatProvider is null , then ToString() is expected to use the culture specified in the system settings.

Moving back to our example, the first item we wish to format is a double, for which we are requesting exponential notation, with the format specifier E . As just mentioned, the StringBuilder.AppendFormat() method will establish that the double does implement IFormattable , and will therefore call the two-parameter ToString() overload, passing it the string E for the first parameter and null for the second parameter. It is now up to the double's implementation of this method to return the string representation of the double in the appropriate format, taking into account the requested format and the current culture. StringBuilder.AppendFormat() will then sort out padding the returned string with spaces, if necessary, in order to fill the 10 characters the format string specified in this case.

The next object to be formatted is an int , for which we are not requesting any particular format (the format specifier was simply {1} ). With no format requested, StringBuilder.AppendFormat() will pass in a null reference for the format string. Again the two-parameter overload of int.ToString() will be expected to respond appropriately. No format has been specifically requested, therefore it will most likely simply call the no-parameter ToString() method.

The whole process can be summarized in this diagram:

The FormattableVector Example

Now that we have established how format strings are constructed, we are going to extend the Vector example from ToString() overload defined by that interface.

The format specifiers we are going to support are:

N Should be interpreted as a request to supply a quantity known as the Norm of the Vector . This is just the sum of squares of its components , which for mathematics buffs happens to be equal to the square of the length of the Vector , and is usually displayed between double vertical bars, like this 34.5 .
VE Should be interpreted as a request to display each component in scientific format, just as the specifier E applied to a double indicates ( 2.3E+01 , 4.5E+02 , 1.0E+00 ).
IJK Should be interpreted as a request to display the vector in the form 23i + 450j + 1k .
Anything else should simply return the default representation of the Vector (23, 450, 1.0) .

To keep things simple, we are not going to implement any option to display the vector in combined IJK and scientific format. We will, however, allow make sure we test the specifier in a case-insensitive way, so that we allow ijk instead of IJK . Note that it is entirely up to us which strings we use to indicate the format specifiers.

To achieve this, we first modify the declaration of Vector so it implements IFormattable :

   struct Vector : IFormattable   {    public double x, y, z;

Now we add our implementation of the 2-parameter ToString() overload:

   public string ToString(string format, IFormatProvider formatProvider)     {     if (format == null)     return ToString();     string formatUpper = format.ToUpper();     switch (formatUpper)     {     case "N":     return " " + Norm().ToString() + " ";     case "VE":     return String.Format("({0:E}, {1:E}, {2:E})", x, y, z);     case "IJK":     StringBuilder sb = new StringBuilder(x.ToString(), 30);     sb.Append(" i + ");     sb.Append(y.ToString());     sb.Append(" j + ");     sb.Append(z.ToString());     sb.Append(" k");     return sb.ToString();     default:     return ToString();     }     }

That is all we have to do! Notice how we take the precaution of checking whether format is null before we call any methods against this parameter we want this method to be as robust as reasonably possible. The format specifiers for all the primitive types are case-insensitive, so that's the behavior that other developers are going to expect from our class too. For the format specifier VE , we need each component to be formatted in scientific notation, so we just use String.Format() again to achieve this. The fields x , y , and z are all doubles. For the case of the IJK format specifier, there are quite a few substrings to be added to the string, so we use a StringBuilder object to improve performance.

For completeness, we will also reproduce the no-parameter ToString() overload that we developed in Chapter 3:

   public override string ToString()     {     return "(" + x + " , " + y + " , " + z + ")";     }

Finally, we need to add a Norm() method that computes the square (norm) of the vector, since we didn't actually supply this method when we first developed the Vector struct in Chapter 3:

   public double Norm()     {     return x*x + y*y + z*z;     }

Now we can try out our formattable vector with some suitable test code:

   static void Main()     {     Vector v1 = new Vector(1,32,5);     Vector v2 = new Vector(845.4, 54.3, -7.8);     Console.WriteLine("\nIn IJK format,\nv1 is {0,30:IJK}\nv2 is {1,30:IJK}",     v1, v2);     Console.WriteLine("\nIn default format,\nv1 is {0,30}\nv2 is {1,30}", v1,     v2);     Console.WriteLine("\nIn VE format\nv1 is {0,30:VE}\nv2 is {1,30:VE}", v1,     v2);     Console.WriteLine("\nNorms are:\nv1 is {0,20:N}\nv2 is {1,20:N}", v1,     v2);     }

The result of running this sample is this:

  FormattableVector  In IJK format, v1 is               1 i + 32 j + 5 k v2 is      845.4 i + 54.3 j + -7.8 k In default format, v1 is                 (1 , 32 , 5) v2 is        (845.4 , 54.3 , -7.8) In VE format v1 is (1.000000E+000, 3.200000E+001, 5.000000E+000) v2 is (8.454000E+002, 5.430000E+001, -7.800000E+000) Norms are: v1 is            1050  v2 is       717710.49

This shows that our custom specifiers are being picked up correctly.