Validating Non-Latin User Input | MCAD Developing and Implementing Web Applications with Visual C#. NET and Visual Studio. NET (Exam [... ]am 2)

Another area in which world-ready applications might require code changes is in handling character strings. Two areas in which different alphabets might require you to implement code changes are string indexing and data sorting. These areas require the most coding attention for non-Latin characters (such as Arabic, Hebrew, or Cyrillic characters), but they can be important when dealing with Latin characters as well.

String Indexing

String indexing refers to the process of extracting single characters from a text string. You might think you could simply iterate through the data that makes up the string 16 bits at a time, treating each 16 bits as a separate character. However, things aren't that simple in the Unicode world.

Unicode supports surrogate pairs and combining character sequences. A surrogate pair is a set of two 16-bit codes that represents a single character from the extended 32-bit Unicode character space. A combining character sequence , on the other hand, is a set of 16-bit codes that represents a single character. Combining character sequences are often used to combine diacritical marks, such as accents, with base characters.

This presents a problem: If characters in a string aren't all the same length, how can you move smoothly from one character to the next ? The answer, of course, is to use the System.Globalization.StringInfo class, which is specially designed for this purpose. The static GetTextElementEnumerator() method of the StringInfo class returns an iterator you can use to move through the string one character at a time, properly handling surrogate pairs and combining characters. The iterator has a MoveNext() method that returns either true when more characters are to be read or false when it has exhausted the characters in the string. The Current property of the iterator returns a single character from the current position of the iterator.

Comparing and Sorting Data

Another area in which you might need to alter code to produce a world-ready application is in working with strings. Different cultures use different alphabetical orders to sort strings, and different cultures compare strings differently. For example, the single-character ligature " " is considered to match the two characters AE in some cultures but not in others.

For the most part, you don't have to do any special programming to account for these factors in the .NET Framework. To make your application world ready, you're more likely to need to remove old code ”for example, code that assumes that characters are properly sorted if you sort their ASCII character numbers . Specifically, the .NET Framwork provides the following culture-aware features:

String.Compare() ” This method compares strings according to the rules of the CultureInfo class referenced by the CurrentCulture property.
CultureInfo.CompareInfo ” This object can search for substrings according to the comparison rules of the current culture.
Array.Sort() ” This method sorts the members of an array by the alphabetic order rules of the current culture.
SortKey.Compare() ” This method compares strings according to the rules of the current culture.