Section 8.1. Basics of Character Usage

8.1. Basics of Character Usage

The use of characters has many aspects, but here we are mainly interested in selecting the most suitable character, when there is a choice between similar-looking characters. The choice may affect the appearance of text, but also the processing of text.

8.1.1. Orthography Sets Rules for Writing

Orthography, or "correct writing," sets rules for using characters. This is largely a matter of writing words correctly, according to rules that some authority has set, or according to established habits and conventions. You might use dictionary and spellcheckers for this. But there are also rules that relate to grammar rather than dictionaries. For example, English orthography has rules for quotations, and different forms of English have somewhat different rules. In U.S. English, you usually "quote," but in British English, you normally 'quote' with single quotation marks.

Although the orthography rules themselves are beyond the scope of this book, there are issues that relate to the identity and coding of characters. For example, the rules of a language might say that a dash is used in some contexts, such as a range notation "040." The rules might not identify what "dash" means, and they might even explicitly leave the length of a dash unspecified, to be regarded as a typographic issue. For the purposes of writing text on a computer, you simply have to decide on the identity of a dash. In coded character sets, there is no dash as such. You need to use the em dash, the en dash, or some other specific dash character. Modern orthographic guidelines resolve such issues.

8.1.2. Typography Is About Appearance

Typography is about typesetting and other tuning of text appearance. Typography deals with fonts, spacing, and line length, for example. Typographic rules suggest, for example, that an expression like "040" should have some small spacing on both sides of the dash, so that it does not touch the surrounding digits. Usually this does not mean the insertion of any characters. Instead, you might use program-specific tools, such as those mentioned in Chapter 2.

In many writing systems, typography is an essential part of writing, not just optional fine-tuning. In English, we may worry about fonts, word division, etc., or we might just unconsciously accept the default settings of a program. Arabic writing, on the other hand, requires the use of appropriate forms for each character according to its immediate context. In typesetting mathematical texts, typography is often essential for readability and understandability. You may need to combine characters from different fonts, and you need to make sure that the intended meaning is clear in spite of this.

8.1.3. Liberal in What You Accept

An old principle in Internet protocol design is "be conservative in what you send, liberal in what you receive." This was formulated in 1981 as follows by Jon Postel in RFC 791:

The implementation of a protocol must be robust. Each implementation must expect to interoperate with others created by different individuals. While the goal of this specification is to be explicit about the protocol there is the possibility of differing interpretations. In general, an implementation must be conservative in its sending behavior, and liberal in its receiving behavior. That is, it must be careful to send well-formed datagrams, but must accept any datagram that it can interpret (e.g., not object to technical errors where the meaning is still clear).

The principle applies to characters and strings as well as datagrams (certain types of messages), and between programs as well as in Internet communication. The idea is that you should play strictly by the rules but not assume that others always do so.

For example, consider the expression of a temperature in centigrade (degrees Celsius). By international standards, the orthographically correct way is to use a space between and the number and the degree sign and to use the character degree sign U+00B0 followed by the letter C, as in "42 °C." Moreover, you should prevent line breaks between the number and the unit, by using a no-break space or by other tools.

However, when reading or otherwise processing data, you should expect to see different temperature notations, such as "42°C" or "42 C." The variation that you can and should deal with depends on the circumstances. For example, if you detect that "42 ºC" actually contains a masculine ordinal indicator º and not the degree sign °, it is practically certain that you still know what was meant. If you design a program that detects such a situation, it should probably process the data under the assumption that the degree sign was meant, without even issuing a message about thisalthough sometimes it might be suitable to issue a mild warning. On the other hand, "42 C" is a more difficult case, since it could conceivably be the correct notation for 42 coulombs, for example.

Similarly, if a program reads a Unicode text file and interprets its content as numeric data, it should recognize, for example, "-42" (with hyphen-minus), "42" (with en dash), and "-42" (with minus sign) all as indicating a negative number. That is, you should not be picky about the use of the minus sign but accept characters that are widely used in the role of a minus sign. Note that common library routines for reading numeric data, like scanf in C, generally treat only the hyphen-minus character as a minus signi.e., they reflect the old and widespread usage and do not accept the "real" minus sign even as an alternative.

8.1.4. Conservative in What You Send

The note in the previous section illustrates the difficulty of being conservative with characters. If you prepare data for an application that is fully equipped to process Unicode data, the conservative way is to use the Unicode minus sign to denote a negative number. It is the most adequate character for the purpose in that context. On the other hand, if you prepare data for an unknown application or a multitude of applications, it is probably much better to use a hyphen-minus character "-" as a replacement for the minus sign.

Even if the immediate target application is Unicode-capable, your data might be transferred from it to something much more limited. For example, a multilingual database could (and normally should) internally use Unicode, but it might be accessed using connections, software, and devices that seriously limit the output and input possibilities. Ideally, the database should contain all text data in the most appropriate Unicode format, and various restrictions on character repertoire should be taken into account when data is sent from it or received by it. As practical principles in being conservative in this sense, we can recommend:

In email messages, use ASCII (Basic Latin) only, by default, unless working with a community that can be expected to be able to deal with other encodings.
In communication within a language community that generally uses a particular character repertoire, use it. For example, in French, German, or Spanish, use Latin 1 Supplement in addition to ASCII (i.e, use ISO Latin 1).
When a wider character repertoire is indispensable, try to limit the use of characters to a subset of Unicode that is known to work widely. For example, in European multilingual contexts and in simple mathematical and technical texts that need special symbols, try to restrict the repertoire to the Minimum European Subset 2 (MES-2). As a more practical criterion, use characters in the Windows Glyph List 4 (WGL 4), which is what the most common fonts cover, more or less.
For text-processing and publishing purposes, try to identify and document in advance the set of characters you will need, and test how the relevant software can handle it. This will help you in identifying the fonts that can be used.