7.2. Character Requirements of LanguagesAlthough Unicode contains almost all characters used in currently used languages, it is still and will always be relevant to consider the character requirements that different languages impose. Here we will first list some of the reasons for this, and then analyze the concept of "character requirements," and finally study some specific languages. 7.2.1. The Impact of Character RepertoireAs mentioned in the section "Definitions of Character Repertoires" in Chapter 1, there are good reasons to try to estimate the repertoire of characters that will appear in a document or in an application. In more detail, the reasons include the following:
At the technical level, there is also the consideration that if you restrict yourself to a small repertoire of characters, you have more options when choosing the encoding. For example, if you use just the characters normally used in English, you can use almost any encoding, including ASCII, ISO 8859 encodings, etc. If you decide that the copyright symbol © is needed, too, then you exclude both ASCII and several of the ISO 8859 encodings (unless you can use "escape notations" like © in HTML). Such technical limitations are slowly losing their importance, but other limitations persist. For example, when designing methods for user input, we should focus on characters that will be used frequently, support some less common characters in a reasonably easy way, and leave the rest up to some generic way, which is not very convenient. In a sense, it is good to make the entry of rarely used characters difficult; thereby they will not appear by mistake so often. 7.2.2. Languages and CharactersLanguages have very varying requirements on the repertoire of characters. English can be written using less than a hundred different characters, whereas Chinese needs thousands of characters, or tens of thousands, if you count the rare characters too. Moreover, the needs are difficult to analyze. Is é needed in English, because it appears in words like "fiancé"? Normal English text, even in a newspaper, may contain special characters like µ on science pages, in a bridge column, and ® in an advertisement. 7.2.2.1. What constitutes a character?Language affects the way people look at characters and what they identify as a single character. This primarily applies to a person's native language. If English is your native language, you may well classify the œ in the French word "œuvre" as just a way of writing "o" and "e" together. After all, in English, expressions like "hors d'oeuvre" are commonly written with separate "o" and "e." If French is your native language, you might treat œ as a single letter and "oe" just as a replacement that is used out of necessity. Perhaps an even better example is æ, which is certainly a separate letter to people who speak Danish and Norwegian but just a typographic variant of "ae" to many English-speaking people, who either never noticed æ or saw it only in contexts where it is apparently just a way of writing "ae" (e.g., in "Cæsar" for "Caesar"). The way we identify characters affects how we count characters. How many letters are there in the string "Cæsar"? This makes instructions, limitations, and operations on character count relative. If you are prompted for some information, to be written in less than 42 characters, how do you know how some program counts characters? When exactness is important, as it might be in contracts, it might be suitable to define explicitly that characters are counted by the number of Unicode characters when the text is in Unicode Normalization Form C. Unfortunately, few people understand what that means, but the same applies to many other exact definitions. 7.2.2.2. Does Unicode support all languages?Short descriptions of Unicode often present it as more universal than it really is. They might, in particular, claim that Unicode supports all languages, or at least all living languages, or that it contains all characters used by humanity. The question "Does Unicode support all languages?" is vague on several counts. To begin with, does "Unicode" refer to the collection of Unicode characters, or to the Unicode standard, or to the Unicode Consortium? What does "support" mean? And what do you mean by "language," and specifically by "all languages"? Thus, any reasonable answer needs to clarify the question. Here is an attempt at a short answer: Almost all living languages, and many dead languages, can be written in their normal writing system(s) using Unicode characters. However, this might not quite mean what you intuitively expect it to mean. Note, in particular, the following points:
These points reflect the design of Unicode, not failures or incompleteness in achieving its goals. On the other hand, there are also some characters used in living languages that have not yet been included in Unicode. Those languages are used by very small communities, and your odds of ever seeing them written are rather small, unless you are an ethnologist or linguist. For example, Unicode 4.1 lacks some Cyrillic characters that are used by some ethnic groups (Enets, Chukchi, etc.) in Russia. The first point means that when writing some languages, we cannot use a single Unicode character (code point) to denote what people intuitively understand as one character in that language. For example, a language may have the letter "i" with macron and grave accent, but in Unicode, it can only be written using two or three characters. In Chapter 4, we described some concepts and techniques meant to help with this. Yet, people may think that Unicode puts such languages to a different position than others. As a thought experiment, let us suppose that the letter "w" had not been included into ASCII or other character codes but written as "vv," and that Unicode had not changed this. When people would then ask for the letter "w" to be included into Unicode, the answer would be that it is just a typographic variant of "vv" written as a ligature (as it historically is, in fact). Maybe after much debate, we would then be told to use the combination of three Unicode characters, "v," word joiner, and "v." Maybe we could officially register this as a character sequence. Yet, could we then really say that Unicode supports the English alphabet, for example? The discussion above deals with "living languages," a subject that is itself a somewhat vague concept. There are extinct languages that are not used as anyone's native language, or otherwise in normal speech or writing, but might still be used quite a lot in scholarly documents, or perhaps used by hobbyists who wish to revive a language. Constructed (artificial) languages have usually been created for use as people's second (or maybe even first) language, but the great majority of them have no actual use, or no use outside a very small circle. Esperanto is the best-known exception; it is well covered by Unicode. Finally, there are languages that might be classified as fictional, such as the Klingon language (from the Star Trek TV series) and the languages of Middle Earth (from the books of J.R.R. Tolkien). Such languages may lack full description, actual usage by human beings, and an established writing system. If fictional languages cannot be written in Unicode, the reason may well be that they are not written at all, but it is also possible that they can be classified as written languages, perhaps with some characters that wait for inclusion into Unicode. 7.2.2.3. Attempts at technical definitions of character requirementsIn 1995, an Internet draft titled "Characters and character sets for various languages" was composed by Harald Alvestrand. Although it expired soon and was in many ways incomplete, it was long used for checking character requirements. After all, if you were asked to design software that can handle characters in some languages that you don't know, you have to start somewhere. The draft is still available at the address http://www.eki.ee/itstandard/docs/draft-alvestrand-lang-char-03.txt. For some languages, it listed "important characters" in addition to "required characters." There was an attempt at creating a "cultural registry" that describes character requirements along with some other information about languages. The structure was described in the ISO 15897 standard (approved in 1999). The registry was not populated with much data, except for some Nordic languages, and the information in it was not used much. The registry technically still exists, at http://anubis.dkuug.dk/cultreg/, but it has not been updated for years. Probably the main reason for the failure was lack of interest and participation by major software vendorsi.e., the organizations on which the wide use of such information mainly depends. The Common Locale Data Repository (CLDR), described in Chapter 11, contains two data fields for describing a language's character requirements with regards to letters:
The description of the CLDR database makes it clear that the basic exemplarCharacters set should be rather narrow:
The content of the exemplarCharacters fields in the CLDR is available, formatted as a table, at http://www.unicode.org/cldr/data/diff/by_type/characters.html. The structure of the CLDR is being developed, and the descriptions of character requirements will probably evolve quite a lot. On the other hand, even at the present stage, the CLDR constitutes the best available overall description of such matters. It should however be used with caution due to the following problems:
7.2.2.4. Which characters does a language need?Questions like "Which characters does language X need?" are both very important and very difficult. It isn't even a well-defined question before you spend quite some time on it. Yet, it affects, or should affect, keyboard design and settings, font choices, input checks, text scanning, etc. Even though Unicode lets you use any characters, roughly speaking, it is still relevant to know which characters will actually be used, or needed. People may disagree on what really belongs to a language, even at the character level. Orthographic rules on punctuation have often been defined so that it is debatable what Unicode characters are meant. For example, the rules may discuss "dash" without telling whether it is an em dash or an en dash or whether either of them could be used. There can also be dispute on whether a character difference should be made between some letters that look very similar to each other. Instead of trying to find a one-dimensional answer, we can specify classes of characters needed in a language in a layered manner. Some characters are essential, some are auxiliary, and some are rare visitors. In a closer analysis, we might consider the following classes:
7.2.3. Language Coverage of ISO Latin AlphabetsThe ISO Latin alphabets are defined by ISO 8859 standards as listed in Table 7-2. There are other ISO 8859 standards, but they define character sets that contain the ASCII characters and some collections of non-Latin letters (see Chapter 3). Note that ISO Latin 5, 6, 7, 8, 9, and 10 correspond to ISO 8859-9, -10, -13, -14, -15, and -16, respectively. The ISO Latin alphabets were primarily designed to meet the needs of some languages used as official or regional languages in Europe and written in Latin letters. Table 7-2 summarizes the suitability of ISO Latin alphabets for them. The information is mainly derived from the ISO 8859 standards. For example, the table says that Croatian can be written in ISO Latin 2 or in ISO Latin 10 (i.e., ISO-8859-16). As a side effect, all or some of ISO Latin alphabets cover other languages as well, such as Afrikaans, Indonesian/Malay, Swahili, and Tagalog. This issue will not be explored here. Support to a language in some repertoire of characters is often subject to interpretation and even debate. In particular, the descriptions in the ISO 8859 standards deal with the availability of letters, not punctuation marks. Moreover, the considerations are limited to modern forms of the languages and to use "for general purpose applications in typical office environments," as ISO 8859 standards put it. To point out some other problems, some entries are marked with an asterisk *, with explanations after the table.
Explanations to Table 7-1:
We will next consider in more detail the character requirements of two languages, French and Spanish. They are rather similar in their writing systems and use of characters, as compared with the variation of world's languages. Yet, problems emerge in the details. 7.2.4. Example: SpanishThe basic character requirements of Spanish include (in addition to ASCII characters):
Except for the dash and the curly quotation marks, Spanish is covered by the ISO Latin 1 character repertoire, and the Windows Latin 1 repertoire adds the missing characters, as well as the euro sign, €. For the purposes of writing Spanish, ISO Latin 9 (ISO 8859-15) is the same as ISO Latin 1 with the addition of the euro sign, but ISO Latin 9 is little used. Some other ISO Latin alphabets could be used for Spanish, too, but most of them lack the inverted exclamation mark and the inverted question mark. Spanish also uses ellipsis points, "puntos suspensivos," but they are usually unspaced, unlike in recommended English practice. Therefore, they can be represented as sequences of three periods (U+002E U+002E U+002E) rather than as the horizontal ellipsis character (U+2026). MS Word helps in writing Spanish, if it has recognized the language from the text or you tell it via Word commands. In Spanish mode, MS Word does not convert three periods to English-style ellipsis as it would otherwise do. It also changes an ! or ? at the start of a sentence to an inverted exclamation or question mark, and it changes, for example, "2a" to "2ª." Somewhat strangely, MS Word produces English-style quotation marks ("bien") in Spanish mode, even though Spanish literary usage favors guillemets («bien»). In Spanish, the acute accent indicates the vowel as stressed, and this may imply a difference in meanings of words. However, in names, the accent rarely has a distinctive meaning. Accented letters are not counted as separate letters in the alphabet, and the accent is taken into account in alphabetic ordering at the secondary level only (i.e., for words that are otherwise the same). By the official rules, accents are used in uppercase letters, too, although it is not rare to deviate from this. Traditionally, the combinations (digraphs) "ch" and "ll" (which denote specific phonemes in Spanish) have been regarded as separate letters, as components of the alphabet position between "c" and "d," and "l" and "m," respectively. However, in 1994, the association of academies for the Spanish decided to accept the treatment of these combinations as pairs of letters, in alphabetic ordering. Previously Spanish had, for example, "correo" < "chico," since "c" < "ch," but now the official sorting rules follow the international pattern. The details of official Spanish orthography can be found in the document "Ortografía de la lengua española," available online via http://www.rae.es/. 7.2.5. Example: FrenchThe basic character requirements of French include (in addition to ASCII characters):
Except for the character œ, the dash, and the curly quotation marks, French is covered by the ISO Latin 1 character repertoire, and the Windows Latin 1 repertoire adds the missing characters, as well as the euro sign, €. However, there is an essential feature in French orthography that cannot be properly addressed in plain text even using Unicode. The orthography rules require thin space (espace fine) after or before some punctuation markse.g., before an exclamation mark. Naturally, such a space should be nonbreaking. This problem is discussed in the section "General Punctuation" in Chapter 8. The letter œ, "oe" ligature, has often been written as the character pair "oe" due to character code limitations. The letter œ was one of the reasons for defining the ISO-8859-15 code (ISO Latin 9), which has not gained much popularity, since you can use œ in windows-1252, and naturally in any Unicode encoding. A normal French keyboard still has no key for œ, so some special technique is needed to type it. There is no simple way to type æ either on a French keyboard. In practice, "ae" is very often used instead, although the dictionary of the French Academy uses æ spellings. MS Word helps in writing French, for example, by turning, in French mode, the ASCII quotation mark to French-style quotation marks (e.g., turning the input "bien" into « bien »). Like Spanish, French uses unspaced periods for ellipsis. Diacritic marks are essential in French, and should be used on uppercase letters, too, according to the recommendation of the French Academy. However, there are differences of opinion and expectations in this area. Therefore, programs often contain a user-settable option for allowing or disallowing accents on capital letters in French text. When they are disallowed, conversion of "égalité" to uppercase would produce "EGALITE." In MS Word, the setting is Tools Options Edit Allow accented uppercase in French. The default setting for this may depend on the version of French (e.g., so that it is normally off for the French of France, but on for Canadian French). On a typical French keyboard ("azerty keyboard "), the methods for typing letters with a diacritic mark are different for different characters. In particular, there is no obvious way to enter the capital letters É and Ç, so the user needs to know and to use some special technique (such as Alt-0201 and Alt-0199, or Ctrl-' E in Word). The letter œ (or Œ) cannot be typed in any obvious way either. Since French uses several diacritic marks, it's easier to get them wrong than in Spanish. For example, "e" with grave, è, and "e" with acute, é, are often confused with each other by foreigners, even though their main purpose is to indicate a difference in pronunciation. When spellchecking is enabled and French is supported in it, such confusion will almost always be detected. There was a large reform of the use of diacritic marks in French in the 1990s. Generally, their use was reduced. Old texts and even old programs (e.g., spellcheckers) might still reflect the old rules. The new rules are described in the document "Rectification de l'orthographe," http://www.academiefrancaise.fr/langue/orthographe/plan.html. |