Section 11.3. The Preparedness Principle

11.3. The Preparedness Principle

Well-written program code is prepared for handling any input data, even data that should not occur. Handling may of course consist of simply detecting an error and, for example, skipping erroneous data silently, skipping it with a warning message, or reporting an error and terminating. In writing a subroutine that will not be called from outside our program, we might consider relying on the caller to pass correct data only, to save both programming and execution time. When writing library routines, especially if they perform complex tasks, the programmer should normally check all input data and expect that, for example, a parameter of string type may contain just anything and of any length.

Processing of character data needs to be efficient, too, if the amount of data is large or processing takes place very often. In most applications, the expected character data is from a small repertoire. When processing data that represents people's answers to questions like "How many...?", we should quickly process an answer that consists of common digits. Whether anything else is treated as an error is a different matter. You might decide to accept other digits too, or even some verbal expressions.

11.3.1. Being Prepared for Amount of Data

In particular, in program code to be invoked by other programs or directly by users in an open environment (e.g., CGI scripts on the Web), checking all data is crucial. The software should expect literally anything, such as a gigabyte of junk sent by a confused or malevolent user. Many attempts at breaking into systems or at making them execute code written by a cracker are based on assumed unpreparedness. Typically, a cracker sends special data that is expected to cause buffer overflowi.e., to make a program store a string larger than the buffer area allocated for it. The overflow may cause the attacker's data to overwrite the program's code so that next it will be executed.

In a form on a web page, even if you use an attribute that is expected to limit the amount of data, it can be overruled. Your form might contain <input name="foo" maxlength="80" size="50">, setting the visible width of a text input field to (about) 50 characters and the maximum amount of data to 80 characters. However, anyone could copy the page, edit the form, and modify or remove the restriction, just to do some experimentation or customization or to break your form handler. This could mean sending data where the field is millions of characters long.

Thus, the classical advice on handling strings in Henry Spencer's Ten Commandments for C Programmers is particularly important in the modern world:

Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest "foo" someone someday shall type "supercalifragilisticexpialidocious."

Of course, you might not use arrays to implement strings in the programming language you use, but the principle is the same: check the lengths of strings.

11.3.2. Being Prepared for Content of Data

In the modern world, we also need to be prepared for any content in strings. Someone someday will type %46\efβ↨♫ or something weirder. There are two basic aspects:

A string may contain characters that have special effects in the program. For example, a program might contain a search operation controlled by a string supplied as data. This may involve security threats by allowing intruders to execute their own code or make the program crash. For examples of what this might mean in the Perl language, and for measures against it, see the Perlmeme.org HOWTO entry "How do I use taint mode," http://perlmeme.org/howtos/secure_code/taint.html.
Some characters may confuse the data processing in a program because there is not programmed handling for them. Of course, most programs are meant to handle only a small repertoire of characters in a useful way. A program should however at least skip characters that it does not know.

11.3.2.1. Methods of handling unexpected characters

When a program encounters a character (or a code point) that it is not prepared to handle normally, it can perform one or several of the following actions. The choice depends, among other things, on the application, its interactivity, and the type of the character.

Pass it through: The character could be treated as an unknown character, which is just passing by. Even though the program does not "know" the character, it would store it as part of a string and save it or pass it forward to any other program.
Skip (ignore) it: This means behaving as if the character were not present in the input. The character is removed when storing input data into a program variable or data structure. This can be adequate for characters that are expected to result from conversions, other technical transformations, and software tools used to create a file. For example, data often contains NUL (U+0000) characters for such reasons, and normally NUL has no meaning in input data. Skipping any Unicode character that a program is not designed to handle is a feasible strategy in some situations.
Warn about it: A program might issue a warning about a character that it cannot handle meaningfully, especially if the character is not expected to appear. The warning might be formulated as an error message, too. The warning should normally identify the character by its Unicode numbere.g., "Unrecognized character (U+1234) detected at line 42 ignored." The number is probably useless to an end user, but it helps a professional who has been asked to help with the problem. It might be more user-friendly to issue a message that indicates the type of the character (such as "Unrecognized letter (U+1234) ...")e.g., by its General Category property value, using a suitable library function.
Map it to something else: A program could treat a character as corresponding to another character, which it can handle properly. This is often user-friendly, but it is also risky. For example, a program that does not handle accented letters could treat them as equivalent to the corresponding unaccented letter. If your database stores strings in ASCII format, you could still allow accents in user input, so that searching for "Rhône" would find an entry about "Rhone." When character data is to be stored, you should probably warn the user about the mapping.

11.3.2.2. Displaying unrecognized or undisplayable code points

A program may need to handle unrecognized characters on display. Any software that renders character data should be somehow prepared for the unexpected. Even if you have some planned processing for any defined Unicode character, the data might contain an unassigned code point, a private use code point, or a noncharacter. Unassigned code points might be assigned later, so handling them means being aware of new versions of Unicode.

When an output routine receives a character that it does not understand, it is usually too late to report an error. Errors should be handled at a higher level in the program logic, and the output routine should expect that this has been done. The Unicode standard mentions, descriptively, the following methods of rendering unassigned code points and private use code points (assuming, of course, that the application does not assign a meaning to such code points):

Display the code number in four to six hexadecimal digits
Display a black or white box
Display a generic, character-like symbol, possibly using different symbols to denote unassigned code points and private use code points
Display nothing; this is recommended for a collection of code points known as default ignorable code points.

In practice, programs often use the question mark ?, too. This, as well as displaying the code number as such, is problem because it cannot always be distinguished from the display of actual data. If possible, use some special formatting (say, a different color) to indicate that something special has happened. Displaying the code number can be informative to people who know character codes but confusing to others. In any case, it might be a good idea to use some delimiters, such as <E000> or {E000} instead of just E000. If possible, use delimiters that do not normally appear as data characters.

Similar considerations apply to characters that a program recognizes but cannot display, typically due to font restrictions. The standard suggests that the program could display a glyph that reflects the type of the character, as derived from its known properties.

11.3.2.3. Default ignorable code points

The Unicode standard defines some characters as ignorable in display by defaulti.e., to be ignored on output if they are not supported in a constructive manner. These characters have no visible glyph or advance width, but when adequately implemented, they may affect the display, positioning, or adornment of adjacent or surrounding characters. The idea is that if a program does not know how to do so, it should not display anything for the character, not even a symbol for a missing character.

Default ignorable code points are described by the Default Ignorable Code Point (DI) property, defined in the DerivedCoreProperties.txt file of the Unicode database. It is a derived property and covers the following:

Code points with a General Category value of Cf (Other, format), Cs (Other, surrogate, or Cc (Other, control), except whitespace characters (e.g., TAB) and interlinear annotation characters U+FFF9..U+FFFB
Noncharacter code points
A set of other characters, defined by the Other Default Ignorable Code Point (ODI) property (in the file PropList.txt); currently, the set contains the combining grapheme joiner U+034F, some Hangul filler characters, and some reserved code points

Default ignorable code points include the soft hyphen U+00A0, the word joiner U+2060, and the left-to-right mark U+200E and the right-to-left mark U+200F (which all have General Category = Cf). Thus, if a program does not support the functionality expressed with some of these characters, it should completely ignore the character on display.

It is permissible for a program to present default ignorable code points in special circumstances, even when it does not implement them as defined. In particular, word processors and layout design programs often have a display mode where invisible formatting characters are shown in some special way.

11.3.3. Table-Driven Versus Property-Driven Processing

In old-style programs that are meant to read ASCII data only, there are only 128 possible input values. In practice, the program actually reads 8 bits, so it should check that the first bit is zero and do something special if it is not. The processing of any normal data, however, can start with a simple branching that uses, for example, a case or switch statement or something similar (depending on language). It is feasible to handle all the possible 128 cases. Alternatively, you could use a table-driven approach that uses a 128-element table to map an input character to something manageable, such as an indicator of its class, according to an application-dependent classification.

In the simplest cases, a program can just test for an input character being "interesting" in the context of the application and skip all other characters. For example, when reading numeric data, a program could recognize just digits and a few other characters like "." and "-" and ignore the rest. However, it is usually much better to report unexpected characters as errors or at least warnings.

When 8-bit character codes are used, similar simple approaches can still be used. A 256-element decision table (or branching construct) is usually not excessively large. When a program reads Unicode data, the situation changes. Even if we consider only BMP characters, there would be tens of thousands of entries to consider. Although modern computers can store and use large tables, the programming work would be excessive.

The Unicode properties of character are, in part, meant to be used to make program logic simpler and programs smaller. You could, for example, first use the General Category property value for the initial branching. You could even group these values by their initial letter: letter (L), mark (M), number (N), separator (Z), punctuation (P), symbol (S), and other (C).

The following rather simple program illustrates several principles described in this chapter. It is meant to work in an environment in which character display is limited to ASCII. It processes a Unicode string and presents it so that ASCII characters are displayed as such whereas other characters are shown using special notations like "[L:f4]," where "L" indicates the character as a letter and "f4" is its code number in hexadecimal. Such presentation might be useful to a knowledgeable person who needs to inspect the content of a Unicode file that mostly consists of ASCII characters. The program branches according to the General Category (gc) value of the character, as obtained using the getType function; the gc values as defined in the Unicode standard are given in comments:

public class show {     public static void printc(String symbol, int data) {         System.out.print("[" + symbol +                          Integer.toHexString(data) + "]"); }     public static void main(String[] args) {         String msg = "Rhône, 42\u00a0§, price £50";         for(int i = 0; i < msg.length(); i++) {             char ch = msg.charAt(i);             int code = ch;             if(code < 0x7F) {   /* ASCII */                 System.out.print(ch); }             else switch(Character.getType(ch)) {             case Character.UPPERCASE_LETTER:          /* Lu */             case Character.LOWERCASE_LETTER:          /* Ll */             case Character.TITLECASE_LETTER:          /* Lt */             case Character.MODIFIER_LETTER:           /* Lm */             case Character.OTHER_LETTER:              /* Lo */                 printc("L:", code);                 break;             case Character.DECIMAL_DIGIT_NUMBER:      /* Nd */             case Character.LETTER_NUMBER:             /* Nl */             case Character.OTHER_NUMBER:              /* No */                 printc("N:", code);                 break;             case Character.NON_SPACING_MARK:          /* Mn */             case Character.COMBINING_SPACING_MARK:    /* Mc */             case Character.ENCLOSING_MARK:            /* Me */                 printc("~:", code);                 break;             case Character.SPACE_SEPARATOR:           /* Zs */                 printc(" :", code);                 break;             case Character.LINE_SEPARATOR:            /* Zl */                 System.out.println();                 break;             case Character.PARAGRAPH_SEPARATOR:       /* Zp */                 System.out.println();                 System.out.println();                 break;             case Character.CONTROL:                   /* Cc */             case Character.FORMAT:                    /* Cf */             case Character.SURROGATE:                 /* Cs */             case Character.UNASSIGNED:                /* Cn */                 if(Character.isWhitespace(ch)) {                     System.out.print(ch); }                 else if(code >= 0xFFF9 && code <= 0xFFFB) {                     printc("A:", code); }                 /* Otherwise: default ignorable, no display */                 break;             case Character.PRIVATE_USE:               /* Co */                 printc("P:", code);                 break;             case Character.CONNECTOR_PUNCTUATION:     /* Pc */                 printc("_:", code);                 break;             case Character.DASH_PUNCTUATION:          /* Pd */                 printc("-:", code);             case Character.START_PUNCTUATION:         /* Ps */                 printc("(:", code);                 break;             case Character.END_PUNCTUATION:           /* Pe */                 printc("):", code);                 break;             case Character.INITIAL_QUOTE_PUNCTUATION: /* Pi */                 System.out.print("[quote]}");                 break;             case Character.FINAL_QUOTE_PUNCTUATION:   /* Pf */                 System.out.print("[unquote]");                 break;             case Character.OTHER_PUNCTUATION:         /* Po */                 printc("!:", code);                 break;             case Character.MATH_SYMBOL:               /* Sm */                 printc("+:", code);                 break;             case Character.CURRENCY_SYMBOL:           /* Sc */                 printc("$:", code);                 break;             case Character.MODIFIER_SYMBOL:           /* Sk */                 printc("^:", code);                 break;             case Character.OTHER_SYMBOL:              /* So */                 printc("S:", code);                 break;             default:                 printc("??:", code);                 break; } }         System.out.println();         System.exit(0);     } }

The program outputs:

Rh[L:f4]ne, 42 [S:a7], price [$:a3]50

Some old Java implementations classify characters with gc values Pi or Pf as if the values were Ps or Pe, respectively. Therefore, they are unable to recognize the predefined names INITIAL_QUOTE_PUNCTUATION and FINAL_QUOTE_PUNCTUATION.

11.3.4. Naïve Processing

In old programs, character data is often processed in a naïve manner that assumes a particular character code, typically ASCII. You might even see code like ch == 32, which tests for a character being a space, using the ASCII code, instead of the more natural and more portable ch == ' '.

Suppose that the variable ch contains a single character and we wish to test whether the value is a letter. The following style (exemplified here using the C language notation) is often used in old software:

if( ((ch >= 'A') && (ch <= 'Z')) ||  ((ch >= 'a') && (ch <= 'z')) ) ...

Here, && means "and" and || means "or," and the expression operates on comparisons that test whether the character's code number is between the code numbers of "A" and "Z" or between the code numbers of "a" and "z." Generally, in programming languages, comparisons of character values operate on the code numbers of characters.

If the data contains only basic Latin letters, the naïve approach works in most cases. The reason is that in most character codes, those letters are in alphabetic order and consecutivei.e., there is nothing but letters between "A" and "Z" or between "a" and "z" in the code. However, the assumption is not correct for the EBCDIC code, as described in Chapter 3.

A more serious problem is that the approach fails for letters with diacritic marks, or for other than basic Latin letters in general. It would be awkward to write code that compares a character value against all the possible letters that might appear in Unicode data. A modern approach, which has been good style for a long time, is to use subprogram (function) calls that test such things. For example, in C, using the standard function library that you may refer to by using #include <string.h> in your program, you can write as follows:

if(isletter(ch)) ...

This is both simpler and more robust. However, it makes the program depend on the definition of the isletter function, which can be locale-dependent. This can be a problem or an asset (see the section "Using Locales" later in this chapter).