System.String | Professional C# 2005 with .NET 3.0

Before examining the other string classes, this section quickly reviews some of the available methods on the String class.

System.String is a class specifically designed to store a string and allow a large number of operations on the string. Also, because of the importance of this data type, C# has its own keyword and associated syntax to make it particularly easy to manipulate strings using this class.

You can concatenate strings using operator overloads:

  string message1 = "Hello";  // returns "Hello" message1 += ", There"; // returns "Hello, There" string message2 = message1 + "!"; // returns "Hello, There!"

C# also allows extraction of a particular character using an indexer-like syntax:

  char char4 = message[4];   // returns 'a'. Note the char is zero-indexed

This enables you to perform such common tasks as replacing characters, removing whitespace, and capitalization. The following table introduces the key methods.

Open table as spreadsheet

Method	Purpose
Compare	Compares the contents of strings, taking into account the culture (locale) in assessing equivalence between certain characters
CompareOrdinal	Same as Compare but doesn’t take culture into account
Concat	Combines separate string instances into a single instance
CopyTo	Copies a specific number of characters from the selected index to a entirely new instance of an array
Format	Formats a string containing various values and specifiers for how each value should be formatted
IndexOf	Locates the first occurrence of a given substring or character in the string
IndexOfAny	Locates the first occurrence of any one of a set of characters in the string
Insert	Inserts a string instance into another string instance at a specified index
Join	Builds a new string by combining an array of strings
LastIndexOf	Same as IndexOf but finds the last occurrence
LastIndexOfAny	Same as IndexOfAny but finds the last occurrence
PadLeft	Pads out the string by adding a specified repeated character to the left side of the string
PadRight	Pads out the string by adding a specified repeated character to the right side of the string
Replace	Replaces occurrences of a given character or substring in the string with another character or substring
Split	Splits the string into an array of substrings, the breaks occurring wherever a given character occurs
Substring	Retrieves the substring starting at a specified position in the string
ToLower	Converts string to lowercase
ToUpper	Converts string to uppercase
Trim	Removes leading and trailing whitespace

Tip

Please note that this table is not comprehensive but is intended to give you an idea of the features offered by strings.

Building Strings

As you have seen, String is an extremely powerful class that implements a large number of very useful methods. However, the String class has a shortcoming that makes it very inefficient for making repeated modifications to a given string - it is actually an immutable data type, which means that once you initialize a string object, that string object can never change. The methods and operators that appear to modify the contents of a string actually create new strings, copying across the contents of the old string if necessary. For example, look at the following code:

  string greetingText = "Hello from all the guys at Wrox Press. "; greetingText += "We do hope you enjoy this book as much as we enjoyed writing it.";

What happens when this code executes is this: first, an object of type System.String is created and initialized to hold the text Hello from all the guys at Wrox Press. Note the space after the period. When this happens, the .NET runtime allocates just enough memory in the string to hold this text (39 chars), and the variable greetingText is set to refer to this string instance.

In the next line, syntactically it looks like some more text is being added onto the string - though it is not. Instead, what happens is that a new string instance is created with just enough memory allocated to store the combined text - that’s 103 characters in total. The original text, Hello from all the people at Wrox Press., is copied into this new string instance along with the extra text, We do hope you enjoy this book as much as we enjoyed writing it. Then, the address stored in the variable greetingText is updated, so the variable correctly points to the new String object. The old String object is now unreferenced - there are no variables that refer to it - and so will be removed the next time the garbage collector comes along to clean out any unused objects in your application.

By itself, that doesn’t look too bad, but suppose that you wanted to encode that string by replacing each letter (not the punctuation) with the character that has an ASCII code further on in the alphabet, as part of some extremely simple encryption scheme. This would change the string to Ifmmp gspn bmm uif hvst bu Xspy Qsftt. Xf ep ipqf zpv fokpz uijt cppl bt nvdi bt xf fokpzfe xsjujoh ju. Several ways of doing this exist, but the simplest and (if you are restricting yourself to using the String class) almost certainly the most efficient way is to use the String.Replace() method, which replaces all occurrences of a given substring in a string with another substring. Using Replace(), the code to encode the text looks like this:

 string greetingText = "Hello from all the guys at Wrox Press. "; greetingText += "We do hope you enjoy this book as much as we enjoyed writing it."; for(int i = 'z'; i>= 'a' ; i--) {    char old1 = (char)i;    char new1 = (char)(i+1);    greetingText = greetingText.Replace(old1, new1); }  for(int i = 'Z'; i>='A' ; i--) {    char old1 = (char)i;    char new1 = (char)(i+1);    greetingText = greetingText.Replace(old1, new1); }  Console.WriteLine("Encoded:\n" + greetingText);

Tip

For simplicity, this code doesn’t wrap Z to A or z to a. These letters get encoded to [ and {, respectively.

Here, the Replace() method works in a fairly intelligent way, to the extent that it won’t actually create a new string unless it actually makes some changes to the old string. The original string contained 23 different lowercase characters and 3 different uppercase ones. The Replace method will therefore have allocated a new string 26 times in total, each new string storing 103 characters. That means that as a result of the encryption process there will be string objects capable of storing a combined total of 2,678 characters now sitting on the heap waiting to be garbage collected! Clearly, if you use strings to do text processing extensively, your applications will run into severe performance problems.

To address this kind of issue, Microsoft has supplied the System.Text.StringBuilder class. StringBuilder isn’t as powerful as String in terms of the number of methods it supports. The processing you can do on a StringBuilder is limited to substitutions and appending or removing text from strings. However, it works in a much more efficient way.

When you construct a string using the String class, just enough memory is allocated to hold the string. The StringBuilder, however, does better than this and normally allocates more memory than is actually needed. You, as a developer, have the option to indicate how much memory the StringBuilder should allocate, but if you don’t, the amount will default to some value that depends on the size of the string that the StringBuilder instance is initialized with. The StringBuilder class has two main properties:

Length, which indicates the length of the string that it actually contains
Capacity, which indicates the maximum length of the string in the memory allocation

Any modifications to the string take place within the block of memory assigned to the StringBuilder instance, which makes appending substrings and replacing individual characters within strings very efficient. Removing or inserting substrings is inevitably still inefficient, because it means that the following part of the string has to be moved. Only if you perform some operation that exceeds the capacity of the string is it necessary to allocate new memory and possibly move the entire contained string. In adding extra capacity, based on our experiments the StringBuilder appears to double its capacity if it detects the capacity has been exceeded and no new value for the capacity has been set.

For example, if you use a StringBuilder object to construct the original greeting string, you might write this code:

  StringBuilder greetingBuilder =    new StringBuilder("Hello from all the guys at Wrox Press. ", 150); greetingBuilder.AppendFormat("We do hope you enjoy this book as much as we enjoyed                         writing it");

Tip

In order to use the StringBuilder class, you will need a System.Text reference in your code.

This code sets an initial capacity of 150 for the StringBuilder. It is always a good idea to set some capacity that covers the likely maximum length of a string, to ensure the StringBuilder doesn’t need to relocate because its capacity was exceeded. Theoretically, you can set as large a number as you can pass in an int, although the system will probably complain that it doesn’t have enough memory if you actually try to allocate the maximum of 2 billion characters (this is the theoretical maximum that a StringBuilder instance is in principle allowed to contain).

When the preceding code is executed, it first creates a StringBuilder object that looks like Figure 8-1.

Figure 8-1

Then, on calling the AppendFormat() method, the remaining text is placed in the empty space, without the need for more memory allocation. However, the real efficiency gain from using a StringBuilder comes when you are making repeated text substitutions. For example, if you try to encrypt the text in the same way as before, you can perform the entire encryption without allocating any more memory whatsoever:

 StringBuilder greetingBuilder =    new StringBuilder("Hello from all the guys at Wrox Press. ", 150); greetingBuilder.AppendFormat("We do hope you enjoy this book as much as we " +    enjoyed writing it");    for(int i = 'z'; i>='a' ; i--)    {       char old1 = (char)i;       char new1 = (char)(i+1);       greetingBuilder = greetingBuilder.Replace(old1, new1);    }        for(int i = 'Z'; i>='A' ; i--)    {       char old1 = (char)i;       char new1 = (char)(i+1);       greetingBuilder = greetingBuilder.Replace(old1, new1);    }        Console.WriteLine("Encoded:\n" + greetingBuilder);

This code uses the StringBuilder.Replace() method, which does the same thing as String.Replace(), but without copying the string in the process. The total memory allocated to hold strings in the preceding code is 150 characters for the StringBuilder instance, as well as the memory allocated during the string operations performed internally in the final Console.WriteLine() statement.

Normally, you will want to use StringBuilder to perform any manipulation of strings and String to store or display the final result.

StringBuilder Members

You have seen a demonstration of one constructor of StringBuilder, which takes an initial string and capacity as its parameters. There are others. For example, you can supply only a string:

  StringBuilder sb = new StringBuilder("Hello");

Or you can create an empty StringBuilder with a given capacity:

  StringBuilder sb = new StringBuilder(20);

Apart from the Length and Capacity properties, there is a read-only MaxCapacity property that indicates the limit to which a given StringBuilder instance is allowed to grow. By default, this is given by int.MaxValue (roughly 2 billion, as noted earlier), but you can set this value to something lower when you construct the StringBuilder object:

  // This will both set initial capacity to 100, but the max will be 500. // Hence, these StringBuilder can never grow to more than 500 characters, // otherwise it will raise exception if you try to do that. StringBuilder sb = new StringBuilder(100, 500);

You can also explicitly set the capacity at any time, though an exception will be raised if you set it to a value less than the current length of the string or a value that exceeds the maximum capacity:

  StringBuilder sb = new StringBuilder("Hello"); sb.Capacity = 100;

The following table lists the main StringBuilder methods.

Open table as spreadsheet

Method	Purpose
Append()	Appends a string to the current string
AppendFormat()	Appends a string that has been worked out from a format specifier
Insert()	Inserts a substring into the current string
Remove()	Removes characters from the current string
Replace()	Replaces all occurrences of a character by another character or a substring with another substring in the current string
ToString()	Returns the current string cast to a System.String object (overridden from System.Object)

Several overloads of many of these methods exist.

Tip

AppendFormat() is actually the method that is ultimately called when you call Console.WriteLine(), which has responsibility for working out what all the format expressions like {0:D} should be replaced with. This method is examined in the next section.

There is no cast (either implicit or explicit) from StringBuilder to String. If you want to output the contents of a StringBuilder as a String, you must use the ToString() method.

Now that you have been introduced to the StringBuilder class and shown some of the ways in which you can use it to increase performance, you should be aware that this class will not always give you the increased performance that you are looking for. Basically, the StringBuilder class should be used when you are manipulating multiple strings. However, if you are just doing something as simple as concatenating two strings, you will find that System.String will be better performing.

Format Strings

So far, a large number of classes and structs have been written for the code samples presented in this book, and they have normally implemented a ToString() method in order to be able to display the contents of a given variable. However, quite often users might want the contents of a variable to be displayed in different, often culture- and locale-dependent, ways. The .NET base class, System.DateTime, provides the most obvious example of this. For example, you might want to display the same date as 10 June 2007, 10 Jun 2007, 6/10/07 (USA), 10/6/07 (UK), or 10.06.2007 (Germany).

Similarly, the Vector struct in Chapter 3, “Objects and Types,” implements the Vector.ToString() method to display the vector in the format (4, 56, 8). There is, however, another very common way of writing vectors, in which this vector would appear as 4i + 56j + 8k. If you want the classes that you write to be user-friendly, they need to support the facility to display their string representations in any of the formats that users are likely to want to use. The .NET runtime defines a standard way that this should be done: the IFormattable interface. Showing how to add this important feature to your classes and structs is the subject of this section.

As you probably know, you need to specify the format in which you want a variable displayed when you call Console.WriteLine(). Therefore, this section uses this method as an example, although most of the discussion applies to any situation in which you want to format a string. For example, if you want to display the value of a variable in a list box or text box, you will normally use the String.Format() method to obtain the appropriate string representation of the variable. However, the actual format specifiers you use to request a particular format are identical to those passed to Console.WriteLine(). Hence, you will focus on Console.WriteLine() as an example. You start by examining what actually happens when you supply a format string to a primitive type, and from this you will see how you can plug in format specifiers for your own classes and structs into the process.

Chapter 2, “C# Basics,” uses format strings in Console.Write() and Console.WriteLine() like this:

  double d = 13.45; int i = 45; Console.WriteLine("The double is {0,10:E} and the int contains {1}", d, i);

The format string itself consists mostly of the text to be displayed, but wherever there is a variable to be formatted, its index in the parameter list appears in braces. You might also include other information inside the brackets concerning the format of that item. For example, you can include:

The number of characters to be occupied by the representation of the item, prefixed by a comma. A negative number indicates that the item should be left-justified, whereas a positive number indicates that it should be right-justified. If the item actually occupies more characters than have been requested, it will still appear in full.
A format specifier, preceded by a colon. This indicates how you want the item to be formatted. For example, you can indicate whether you want a number to be formatted as a currency or displayed in scientific notation.

The following table lists the common format specifiers for the numeric types, which were briefly discussed in Chapter 2.

Open table as spreadsheet

Specifier	Applies To	Meaning	Example
C	Numeric types	Locale-specific monetary value	$4834.50 (USA) £4834.50 (UK)
D	Integer types only	General integer	4834
E	Numeric types	Scientific notation	4.834E+003
F	Numeric types	Fixed-point decimal	4384.50
G	Numeric types	General number	4384.5
N	Numeric types	Common locale-specific format for numbers	4,384.50 (UK/USA) 4 384,50 (continental Europe)
P	Numeric types	Percentage notation	432,000.00%
X	Integer types only	Hexadecimal format	1120 (If you want to display 0x1120, you will have to write out the 0x separately)

If you want an integer to be padded with zeros, you can use the format specifier 0 (zero) repeated as many times as the number length is required. For example, the format specifier 0000 will cause 3 to be displayed as 0003, and 99 to be displayed as 0099, and so on.

It is not possible to give a complete list, because other data types can add their own specifiers. Showing how to define your own specifiers for your own classes is the aim of this section.

How the String Is Formatted

As an example of how strings are formatted, if you execute the following statement:

  Console.WriteLine("The double is {0,10:E} and the int contains {1}", d, i);

Console.WriteLine() just passes the entire set of parameters to the static method, String.Format(). This is the same method that you would call if you wanted to format these values for use in a string to be displayed in a text box, for example. The implementation of the three-parameter overload of WriteLine() basically does this:

  // Likely implementation of Console.WriteLine() public void WriteLine(string format, object arg0, object arg1) {    Console.WriteLine(string.Format(format, arg0, arg1)); }

The one-parameter overload of this method, which is in turn called in the preceding code sample, simply writes out the contents of the string it has been passed, without doing any further formatting on it.

String.Format() now needs to construct the final string by replacing each format specifier with a suitable string representation of the corresponding object. However, as you saw earlier, for this process of building up a string, you need a StringBuilder instance rather than a string instance. In this example, a StringBuilder instance is created and initialized with the first known portion of the string, the text “The double is “. Next, the StringBuilder.AppendFormat() method is called, passing in the first format specifier, {0,10:E}, as well as the associated object, double, in order to add the string representation of this object to the string object being constructed. This process continues with StringBuilder.Append() and StringBuilder.AppendFormat() being called repeatedly until the entire formatted string has been obtained.

Now comes the interesting part; StringBuilder.AppendFormat() has to figure out how to format the object. First, it probes the object to find out whether it implements an interface in the System namespace called IFormattable. You can find this out quite simply by trying to cast an object to this interface and seeing whether the cast succeeds, or by using the C# is keyword. If this test fails, AppendFormat() calls the object’s ToString() method, which all objects either inherit from System.Object or override. This is exactly what happens here, because none of the classes written so far has implemented this interface. That is why the overrides of Object.ToString() have been sufficient to allow the structs and classes from earlier chapters such as Vector to get displayed in Console.WriteLine() statements.

However, all of the predefined primitive numeric types do implement this interface, which means that for those types, and in particular for double and int in the example, the basic ToString() method inherited from System.Object will not be called. To understand what happens instead, you need to examine the IFormattable interface.

IFormattable defines just one method, which is also called ToString(). However, this method takes two parameters as opposed to the System.Object version, which doesn’t take any parameters. The following code shows the definition of IFormattable:

  interface IFormattable {    string ToString(string format, IFormatProvider formatProvider); }

The first parameter that this overload of ToString() expects is a string that specifies the requested format. In other words, it is the specifier portion of the string that appears inside the braces ({}) in the string originally passed to Console.WriteLine() or String.Format(). For example, in the example the original statement was:

  Console.WriteLine("The double is {0,10:E} and the int contains {1}", d, i);

Hence, when evaluating the first specifier, {0,10:E}, this overload will be called against the double variable, d, and the first parameter passed to it will be E. StringBuilder.AppendFormat() will pass in here the text that appears after the colon in the appropriate format specifier from the original string.

We won’t worry about the second ToString() parameter in this book. It is a reference to an object that implements the IFormatProvider interface. This interface gives further information that ToString() might need to consider when formatting the object such as culture-specific details (a .NET culture is similar to a Windows locale; if you are formatting currencies or dates, you need this information). If you are calling this ToString() overload directly from your source code, you might want to supply such an object. However, StringBuilder.AppendFormat() passes in null for this parameter. If formatProvider is null, then ToString() is expected to use the culture specified in the system settings.

Getting back to the example, the first item you want to format is a double, for which you are requesting exponential notation, with the format specifier E. The StringBuilder.AppendFormat() method establishes that the double does implement IFormattable, and will therefore call the two-parameter ToString() overload, passing it the string E for the first parameter and null for the second parameter. It is now up to the double’s implementation of this method to return the string representation of the double in the appropriate format, taking into account the requested format and the current culture. StringBuilder .AppendFormat() will then sort out padding the returned string with spaces, if necessary, to fill the 10 characters the format string specified.

The next object to be formatted is an int, for which you are not requesting any particular format (the format specifier was simply {1}). With no format requested, StringBuilder.AppendFormat() passes in a null reference for the format string. The two-parameter overload of int.ToString() is expected to respond appropriately. No format has been specifically requested; therefore, it will call the no-parameter ToString() method.

This entire string formatting process is summarized in Figure 8-2.

image from book
Figure 8-2

The FormattableVector example

Now that you know how format strings are constructed, in this section you extend the Vector example from earlier in the book, so that you can format vectors in a variety of ways. You can download the code for this example from www.wrox.com. Now that you understand the principles involved, you will discover the actual coding is quite simple. All you need to do is implement IFormattable and supply an implementation of the ToString() overload defined by that interface.

The format specifiers you are going to support are:

N - Should be interpreted as a request to supply a quantity known as the Norm of the Vector. This is just the sum of squares of its components, which for mathematics buffs happens to be equal to the square of the length of the Vector, and is usually displayed between double vertical bars, like this: ||34.5||.
VE - Should be interpreted as a request to display each component in scientific format, just as the specifier E applied to a double indicates (2.3E+01, 4.5E+02, 1.0E+00).
IJK - Should be interpreted as a request to display the vector in the form 23i + 450j + 1k.
Anything else should simply return the default representation of the Vector (23, 450, 1.0).

To keep things simple, you are not going to implement any option to display the vector in combined IJK and scientific format. You will, however, make sure you test the specifier in a case-insensitive way, so that you allow ijk instead of IJK. Note that it is entirely up to you which strings you use to indicate the format specifiers.

To achieve this, you first modify the declaration of Vector so it implements IFormattable:

 struct Vector : IFormattable {    public double x, y, z;        // Beginning part of Vector

Now you add your implementation of the two-parameter ToString() overload:

  public string ToString(string format, IFormatProvider formatProvider) {    if (format == null)    {       return ToString();    }    string formatUpper = format.ToUpper();    switch (formatUpper)    {       case "N":          return "|| " + Norm().ToString() + " ||";       case "VE":          return String.Format("( {0:E}, {1:E}, {2:E} )", x, y, z);       case "IJK":          StringBuilder sb = new StringBuilder(x.ToString(), 30);          sb.AppendFormat(" i + ");          sb.AppendFormat(y.ToString());          sb.AppendFormat(" j + ");          sb.AppendFormat(z.ToString());          sb.AppendFormat(" k");          return sb.ToString();       default:          return ToString();    } }

That is all you have to do! Notice how you take the precaution of checking whether format is null before you call any methods against this parameter - you want this method to be as robust as reasonably possible. The format specifiers for all the primitive types are case insensitive, so that’s the behavior that other developers are going to expect from your class, too. For the format specifier VE, you need each component to be formatted in scientific notation, so you just use String.Format() again to achieve this. The fields x, y, and z are all doubles. For the case of the IJK format specifier, there are quite a few substrings to be added to the string, so you use a StringBuilder object to improve performance.

For completeness, you also reproduce the no-parameter ToString() overload developed earlier:

  public override string ToString() {    return "( " + x + " , " + y + " , " + z + " )"; }

Finally, you need to add a Norm() method that computes the square (norm) of the vector, because you didn’t actually supply this method when you developed the Vector struct:

  public double Norm() {    return x*x + y*y + z*z; }

Now you can try out your formattable vector with some suitable test code:

  static void Main() {    Vector v1 = new Vector(1,32,5);    Vector v2 = new Vector(845.4, 54.3, -7.8);    Console.WriteLine("\nIn IJK format,\nv1 is {0,30:IJK}\nv2 is {1,30:IJK}",                       v1, v2);    Console.WriteLine("\nIn default format,\nv1 is {0,30}\nv2 is {1,30}", v1, v2);    Console.WriteLine("\nIn VE format\nv1 is {0,30:VE}\nv2 is {1,30:VE}", v1, v2);    Console.WriteLine("\nNorms are:\nv1 is {0,20:N}\nv2 is {1,20:N}", v1, v2); }

The result of running this sample is this:

 FormattableVector In IJK format, v1 is               1 i + 32 j + 5 k v2 is      845.4 i + 54.3 j + -7.8 k In default format, v1 is                 ( 1 , 32 , 5 ) v2 is        ( 845.4 , 54.3 , -7.8 ) In VE format v1 is ( 1.000000E+000, 3.200000E+001, 5.000000E+000 ) v2 is ( 8.454000E+002, 5.430000E+001, -7.800000E+000 ) Norms are: v1 is           || 1050 || v2 is      || 717710.49 ||

This shows that your custom specifiers are being picked up correctly.