11.3. The Preparedness PrincipleWell-written program code is prepared for handling any input data, even data that should not occur. Handling may of course consist of simply detecting an error and, for example, skipping erroneous data silently, skipping it with a warning message, or reporting an error and terminating. In writing a subroutine that will not be called from outside our program, we might consider relying on the caller to pass correct data only, to save both programming and execution time. When writing library routines, especially if they perform complex tasks, the programmer should normally check all input data and expect that, for example, a parameter of string type may contain just anything and of any length. Processing of character data needs to be efficient, too, if the amount of data is large or processing takes place very often. In most applications, the expected character data is from a small repertoire. When processing data that represents people's answers to questions like "How many...?", we should quickly process an answer that consists of common digits. Whether anything else is treated as an error is a different matter. You might decide to accept other digits too, or even some verbal expressions. 11.3.1. Being Prepared for Amount of DataIn particular, in program code to be invoked by other programs or directly by users in an open environment (e.g., CGI scripts on the Web), checking all data is crucial. The software should expect literally anything, such as a gigabyte of junk sent by a confused or malevolent user. Many attempts at breaking into systems or at making them execute code written by a cracker are based on assumed unpreparedness. Typically, a cracker sends special data that is expected to cause buffer overflowi.e., to make a program store a string larger than the buffer area allocated for it. The overflow may cause the attacker's data to overwrite the program's code so that next it will be executed. In a form on a web page, even if you use an attribute that is expected to limit the amount of data, it can be overruled. Your form might contain <input name="foo" maxlength="80" size="50">, setting the visible width of a text input field to (about) 50 characters and the maximum amount of data to 80 characters. However, anyone could copy the page, edit the form, and modify or remove the restriction, just to do some experimentation or customization or to break your form handler. This could mean sending data where the field is millions of characters long. Thus, the classical advice on handling strings in Henry Spencer's Ten Commandments for C Programmers is particularly important in the modern world:
Of course, you might not use arrays to implement strings in the programming language you use, but the principle is the same: check the lengths of strings. 11.3.2. Being Prepared for Content of DataIn the modern world, we also need to be prepared for any content in strings. Someone someday will type %46\efβ↨♫ or something weirder. There are two basic aspects:
11.3.2.1. Methods of handling unexpected charactersWhen a program encounters a character (or a code point) that it is not prepared to handle normally, it can perform one or several of the following actions. The choice depends, among other things, on the application, its interactivity, and the type of the character.
11.3.2.2. Displaying unrecognized or undisplayable code pointsA program may need to handle unrecognized characters on display. Any software that renders character data should be somehow prepared for the unexpected. Even if you have some planned processing for any defined Unicode character, the data might contain an unassigned code point, a private use code point, or a noncharacter. Unassigned code points might be assigned later, so handling them means being aware of new versions of Unicode. When an output routine receives a character that it does not understand, it is usually too late to report an error. Errors should be handled at a higher level in the program logic, and the output routine should expect that this has been done. The Unicode standard mentions, descriptively, the following methods of rendering unassigned code points and private use code points (assuming, of course, that the application does not assign a meaning to such code points):
In practice, programs often use the question mark ?, too. This, as well as displaying the code number as such, is problem because it cannot always be distinguished from the display of actual data. If possible, use some special formatting (say, a different color) to indicate that something special has happened. Displaying the code number can be informative to people who know character codes but confusing to others. In any case, it might be a good idea to use some delimiters, such as <E000> or {E000} instead of just E000. If possible, use delimiters that do not normally appear as data characters. Similar considerations apply to characters that a program recognizes but cannot display, typically due to font restrictions. The standard suggests that the program could display a glyph that reflects the type of the character, as derived from its known properties. 11.3.2.3. Default ignorable code pointsThe Unicode standard defines some characters as ignorable in display by defaulti.e., to be ignored on output if they are not supported in a constructive manner. These characters have no visible glyph or advance width, but when adequately implemented, they may affect the display, positioning, or adornment of adjacent or surrounding characters. The idea is that if a program does not know how to do so, it should not display anything for the character, not even a symbol for a missing character. Default ignorable code points are described by the Default Ignorable Code Point (DI) property, defined in the DerivedCoreProperties.txt file of the Unicode database. It is a derived property and covers the following:
Default ignorable code points include the soft hyphen U+00A0, the word joiner U+2060, and the left-to-right mark U+200E and the right-to-left mark U+200F (which all have General Category = Cf). Thus, if a program does not support the functionality expressed with some of these characters, it should completely ignore the character on display. It is permissible for a program to present default ignorable code points in special circumstances, even when it does not implement them as defined. In particular, word processors and layout design programs often have a display mode where invisible formatting characters are shown in some special way. 11.3.3. Table-Driven Versus Property-Driven ProcessingIn old-style programs that are meant to read ASCII data only, there are only 128 possible input values. In practice, the program actually reads 8 bits, so it should check that the first bit is zero and do something special if it is not. The processing of any normal data, however, can start with a simple branching that uses, for example, a case or switch statement or something similar (depending on language). It is feasible to handle all the possible 128 cases. Alternatively, you could use a table-driven approach that uses a 128-element table to map an input character to something manageable, such as an indicator of its class, according to an application-dependent classification. In the simplest cases, a program can just test for an input character being "interesting" in the context of the application and skip all other characters. For example, when reading numeric data, a program could recognize just digits and a few other characters like "." and "-" and ignore the rest. However, it is usually much better to report unexpected characters as errors or at least warnings. When 8-bit character codes are used, similar simple approaches can still be used. A 256-element decision table (or branching construct) is usually not excessively large. When a program reads Unicode data, the situation changes. Even if we consider only BMP characters, there would be tens of thousands of entries to consider. Although modern computers can store and use large tables, the programming work would be excessive. The Unicode properties of character are, in part, meant to be used to make program logic simpler and programs smaller. You could, for example, first use the General Category property value for the initial branching. You could even group these values by their initial letter: letter (L), mark (M), number (N), separator (Z), punctuation (P), symbol (S), and other (C). The following rather simple program illustrates several principles described in this chapter. It is meant to work in an environment in which character display is limited to ASCII. It processes a Unicode string and presents it so that ASCII characters are displayed as such whereas other characters are shown using special notations like "[L:f4]," where "L" indicates the character as a letter and "f4" is its code number in hexadecimal. Such presentation might be useful to a knowledgeable person who needs to inspect the content of a Unicode file that mostly consists of ASCII characters. The program branches according to the General Category (gc) value of the character, as obtained using the getType function; the gc values as defined in the Unicode standard are given in comments: public class show { public static void printc(String symbol, int data) { System.out.print("[" + symbol + Integer.toHexString(data) + "]"); } public static void main(String[] args) { String msg = "Rhône, 42\u00a0§, price £50"; for(int i = 0; i < msg.length(); i++) { char ch = msg.charAt(i); int code = ch; if(code < 0x7F) { /* ASCII */ System.out.print(ch); } else switch(Character.getType(ch)) { case Character.UPPERCASE_LETTER: /* Lu */ case Character.LOWERCASE_LETTER: /* Ll */ case Character.TITLECASE_LETTER: /* Lt */ case Character.MODIFIER_LETTER: /* Lm */ case Character.OTHER_LETTER: /* Lo */ printc("L:", code); break; case Character.DECIMAL_DIGIT_NUMBER: /* Nd */ case Character.LETTER_NUMBER: /* Nl */ case Character.OTHER_NUMBER: /* No */ printc("N:", code); break; case Character.NON_SPACING_MARK: /* Mn */ case Character.COMBINING_SPACING_MARK: /* Mc */ case Character.ENCLOSING_MARK: /* Me */ printc("~:", code); break; case Character.SPACE_SEPARATOR: /* Zs */ printc(" :", code); break; case Character.LINE_SEPARATOR: /* Zl */ System.out.println(); break; case Character.PARAGRAPH_SEPARATOR: /* Zp */ System.out.println(); System.out.println(); break; case Character.CONTROL: /* Cc */ case Character.FORMAT: /* Cf */ case Character.SURROGATE: /* Cs */ case Character.UNASSIGNED: /* Cn */ if(Character.isWhitespace(ch)) { System.out.print(ch); } else if(code >= 0xFFF9 && code <= 0xFFFB) { printc("A:", code); } /* Otherwise: default ignorable, no display */ break; case Character.PRIVATE_USE: /* Co */ printc("P:", code); break; case Character.CONNECTOR_PUNCTUATION: /* Pc */ printc("_:", code); break; case Character.DASH_PUNCTUATION: /* Pd */ printc("-:", code); case Character.START_PUNCTUATION: /* Ps */ printc("(:", code); break; case Character.END_PUNCTUATION: /* Pe */ printc("):", code); break; case Character.INITIAL_QUOTE_PUNCTUATION: /* Pi */ System.out.print("[quote]}"); break; case Character.FINAL_QUOTE_PUNCTUATION: /* Pf */ System.out.print("[unquote]"); break; case Character.OTHER_PUNCTUATION: /* Po */ printc("!:", code); break; case Character.MATH_SYMBOL: /* Sm */ printc("+:", code); break; case Character.CURRENCY_SYMBOL: /* Sc */ printc("$:", code); break; case Character.MODIFIER_SYMBOL: /* Sk */ printc("^:", code); break; case Character.OTHER_SYMBOL: /* So */ printc("S:", code); break; default: printc("??:", code); break; } } System.out.println(); System.exit(0); } } The program outputs: Rh[L:f4]ne, 42 [S:a7], price [$:a3]50
11.3.4. Naïve ProcessingIn old programs, character data is often processed in a naïve manner that assumes a particular character code, typically ASCII. You might even see code like ch == 32, which tests for a character being a space, using the ASCII code, instead of the more natural and more portable ch == ' '. Suppose that the variable ch contains a single character and we wish to test whether the value is a letter. The following style (exemplified here using the C language notation) is often used in old software: if( ((ch >= 'A') && (ch <= 'Z')) || ((ch >= 'a') && (ch <= 'z')) ) ... Here, && means "and" and || means "or," and the expression operates on comparisons that test whether the character's code number is between the code numbers of "A" and "Z" or between the code numbers of "a" and "z." Generally, in programming languages, comparisons of character values operate on the code numbers of characters. If the data contains only basic Latin letters, the naïve approach works in most cases. The reason is that in most character codes, those letters are in alphabetic order and consecutivei.e., there is nothing but letters between "A" and "Z" or between "a" and "z" in the code. However, the assumption is not correct for the EBCDIC code, as described in Chapter 3. A more serious problem is that the approach fails for letters with diacritic marks, or for other than basic Latin letters in general. It would be awkward to write code that compares a character value against all the possible letters that might appear in Unicode data. A modern approach, which has been good style for a long time, is to use subprogram (function) calls that test such things. For example, in C, using the standard function library that you may refer to by using #include <string.h> in your program, you can write as follows: if(isletter(ch)) ... This is both simpler and more robust. However, it makes the program depend on the definition of the isletter function, which can be locale-dependent. This can be a problem or an asset (see the section "Using Locales" later in this chapter). |