Section 7.2. Character Requirements of Languages


7.2. Character Requirements of Languages

Although Unicode contains almost all characters used in currently used languages, it is still and will always be relevant to consider the character requirements that different languages impose. Here we will first list some of the reasons for this, and then analyze the concept of "character requirements," and finally study some specific languages.

7.2.1. The Impact of Character Repertoire

As mentioned in the section "Definitions of Character Repertoires" in Chapter 1, there are good reasons to try to estimate the repertoire of characters that will appear in a document or in an application. In more detail, the reasons include the following:

  • A font typically supports a limited character repertoire only. Full Unicode fonts are rare, and usually not suitable for copy text.

  • In particular, artistic or otherwise special fonts, such as those used for headings and buttons, often have a very limited character repertoire.

  • A program that will be used for processing your document in some way might be prepared to handle a limited repertoire only.

  • Special characters in normal text often result from mistyping or other errors. When checking input data, it is often useful to detect any "unusual" characters and issue warnings about them.

  • In particular, character recognition (in scanning text or in processing handwritten characters) works best if the assumed repertoire is small. It can be very difficult to distinguish between similar-looking characters like and ǎ ("a with breve and "a" with caron). Things are much easier if you can expect only one of them to occur.

At the technical level, there is also the consideration that if you restrict yourself to a small repertoire of characters, you have more options when choosing the encoding. For example, if you use just the characters normally used in English, you can use almost any encoding, including ASCII, ISO 8859 encodings, etc. If you decide that the copyright symbol © is needed, too, then you exclude both ASCII and several of the ISO 8859 encodings (unless you can use "escape notations" like © in HTML).

Such technical limitations are slowly losing their importance, but other limitations persist. For example, when designing methods for user input, we should focus on characters that will be used frequently, support some less common characters in a reasonably easy way, and leave the rest up to some generic way, which is not very convenient. In a sense, it is good to make the entry of rarely used characters difficult; thereby they will not appear by mistake so often.

7.2.2. Languages and Characters

Languages have very varying requirements on the repertoire of characters. English can be written using less than a hundred different characters, whereas Chinese needs thousands of characters, or tens of thousands, if you count the rare characters too. Moreover, the needs are difficult to analyze. Is é needed in English, because it appears in words like "fiancé"? Normal English text, even in a newspaper, may contain special characters like µ on science pages, in a bridge column, and ® in an advertisement.

7.2.2.1. What constitutes a character?

Language affects the way people look at characters and what they identify as a single character. This primarily applies to a person's native language. If English is your native language, you may well classify the œ in the French word "œuvre" as just a way of writing "o" and "e" together. After all, in English, expressions like "hors d'oeuvre" are commonly written with separate "o" and "e." If French is your native language, you might treat œ as a single letter and "oe" just as a replacement that is used out of necessity. Perhaps an even better example is æ, which is certainly a separate letter to people who speak Danish and Norwegian but just a typographic variant of "ae" to many English-speaking people, who either never noticed æ or saw it only in contexts where it is apparently just a way of writing "ae" (e.g., in "Cæsar" for "Caesar").

The way we identify characters affects how we count characters. How many letters are there in the string "Cæsar"? This makes instructions, limitations, and operations on character count relative. If you are prompted for some information, to be written in less than 42 characters, how do you know how some program counts characters? When exactness is important, as it might be in contracts, it might be suitable to define explicitly that characters are counted by the number of Unicode characters when the text is in Unicode Normalization Form C. Unfortunately, few people understand what that means, but the same applies to many other exact definitions.

7.2.2.2. Does Unicode support all languages?

Short descriptions of Unicode often present it as more universal than it really is. They might, in particular, claim that Unicode supports all languages, or at least all living languages, or that it contains all characters used by humanity.

The question "Does Unicode support all languages?" is vague on several counts. To begin with, does "Unicode" refer to the collection of Unicode characters, or to the Unicode standard, or to the Unicode Consortium? What does "support" mean? And what do you mean by "language," and specifically by "all languages"?

Thus, any reasonable answer needs to clarify the question. Here is an attempt at a short answer: Almost all living languages, and many dead languages, can be written in their normal writing system(s) using Unicode characters. However, this might not quite mean what you intuitively expect it to mean. Note, in particular, the following points:

  • Some languages use characters that cannot be represented as single Unicode characters but need to be written as combinations (sequences of Unicode characters). For example, some accented characters cannot be written as a single character but as base characters followed by some combining diacritic mark(s). In this sense, the claim that Unicode "provides a unique number for every character" (as the Consortium's page "What is Unicode?" says) is somewhat misleading.

  • Some orthographic and typographic differences that could be expressed in plain text cannot be expressed in Unicode. This results from the unification policy, which often treats, for example, the differences between Chinese and Japanese characters as typographic.

  • Some of the properties of characters as defined by the Unicode standard do not correspond to their behavior in different languages. For example, Unicode line-breaking rules previously permitted a line break after a colon :, but some languages use it inside words, and a line break after a colon can seriously violate the rules of the language.

  • Unicode is meant to describe plain text only, so it generally lacks any support that might be needed for display and processing of text by language-specific rules.

These points reflect the design of Unicode, not failures or incompleteness in achieving its goals. On the other hand, there are also some characters used in living languages that have not yet been included in Unicode. Those languages are used by very small communities, and your odds of ever seeing them written are rather small, unless you are an ethnologist or linguist. For example, Unicode 4.1 lacks some Cyrillic characters that are used by some ethnic groups (Enets, Chukchi, etc.) in Russia.

The first point means that when writing some languages, we cannot use a single Unicode character (code point) to denote what people intuitively understand as one character in that language. For example, a language may have the letter "i" with macron and grave accent, but in Unicode, it can only be written using two or three characters. In Chapter 4, we described some concepts and techniques meant to help with this. Yet, people may think that Unicode puts such languages to a different position than others.

As a thought experiment, let us suppose that the letter "w" had not been included into ASCII or other character codes but written as "vv," and that Unicode had not changed this. When people would then ask for the letter "w" to be included into Unicode, the answer would be that it is just a typographic variant of "vv" written as a ligature (as it historically is, in fact). Maybe after much debate, we would then be told to use the combination of three Unicode characters, "v," word joiner, and "v." Maybe we could officially register this as a character sequence. Yet, could we then really say that Unicode supports the English alphabet, for example?

The discussion above deals with "living languages," a subject that is itself a somewhat vague concept. There are extinct languages that are not used as anyone's native language, or otherwise in normal speech or writing, but might still be used quite a lot in scholarly documents, or perhaps used by hobbyists who wish to revive a language. Constructed (artificial) languages have usually been created for use as people's second (or maybe even first) language, but the great majority of them have no actual use, or no use outside a very small circle. Esperanto is the best-known exception; it is well covered by Unicode. Finally, there are languages that might be classified as fictional, such as the Klingon language (from the Star Trek TV series) and the languages of Middle Earth (from the books of J.R.R. Tolkien). Such languages may lack full description, actual usage by human beings, and an established writing system. If fictional languages cannot be written in Unicode, the reason may well be that they are not written at all, but it is also possible that they can be classified as written languages, perhaps with some characters that wait for inclusion into Unicode.

7.2.2.3. Attempts at technical definitions of character requirements

In 1995, an Internet draft titled "Characters and character sets for various languages" was composed by Harald Alvestrand. Although it expired soon and was in many ways incomplete, it was long used for checking character requirements. After all, if you were asked to design software that can handle characters in some languages that you don't know, you have to start somewhere. The draft is still available at the address http://www.eki.ee/i⁠t⁠s⁠t⁠a⁠n⁠d⁠a⁠r⁠d/docs/draft-alvestrand-lang-char-03.txt. For some languages, it listed "important characters" in addition to "required characters."

There was an attempt at creating a "cultural registry" that describes character requirements along with some other information about languages. The structure was described in the ISO 15897 standard (approved in 1999). The registry was not populated with much data, except for some Nordic languages, and the information in it was not used much. The registry technically still exists, at http://anubis.dkuug.dk/cultreg/, but it has not been updated for years. Probably the main reason for the failure was lack of interest and participation by major software vendorsi.e., the organizations on which the wide use of such information mainly depends.

The Common Locale Data Repository (CLDR), described in Chapter 11, contains two data fields for describing a language's character requirements with regards to letters:


Basic characters (exemplarCharacters)

Letters needed for normal writing of the language. For English, this consists of the letters a to z only. (Uppercase forms are implicitly included.)


Auxiliary characters (exemplarCharacters with type="auxiliary")

Additional letters that may appear in texts in the language, typically in (relatively) common foreign words. For English, this currently consists of the following set: áà éè íì óò úù âêîôû æœ äëïöüÿ åø çñß. As you can see, it is a rather mixed collection and contains several characters outside the Windows Latin 1 repertoire.

The description of the CLDR database makes it clear that the basic exemplarCharacters set should be rather narrow:

In general, the test to see whether or not a letter belongs in the set is based on whether it is acceptable in that language to always use spellings that avoid that character. For example, the exemplar character set for en (English) is the set [⁠a-z⁠]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents.

The content of the exemplarCharacters fields in the CLDR is available, formatted as a table, at http://www.unicode.org/cldr/data/diff/by_type/characters.html.

The structure of the CLDR is being developed, and the descriptions of character requirements will probably evolve quite a lot. On the other hand, even at the present stage, the CLDR constitutes the best available overall description of such matters. It should however be used with caution due to the following problems:

  • The description, with just two levels of requirements, is too coarse (see below).

  • The description only covers letters, not, for example, punctuation.

  • Not all data has been checked sufficiently carefully by authorities and experts on a language.

  • The data is insufficiente.g., with regards to the description of auxiliary characters, which have been specified for a few languages only.

7.2.2.4. Which characters does a language need?

Questions like "Which characters does language X need?" are both very important and very difficult. It isn't even a well-defined question before you spend quite some time on it. Yet, it affects, or should affect, keyboard design and settings, font choices, input checks, text scanning, etc. Even though Unicode lets you use any characters, roughly speaking, it is still relevant to know which characters will actually be used, or needed.

People may disagree on what really belongs to a language, even at the character level. Orthographic rules on punctuation have often been defined so that it is debatable what Unicode characters are meant. For example, the rules may discuss "dash" without telling whether it is an em dash or an en dash or whether either of them could be used. There can also be dispute on whether a character difference should be made between some letters that look very similar to each other.

Instead of trying to find a one-dimensional answer, we can specify classes of characters needed in a language in a layered manner. Some characters are essential, some are auxiliary, and some are rare visitors. In a closer analysis, we might consider the following classes:


Core characters

This class includes the characters that are regarded as absolutely necessary for normal writing of the language. It roughly corresponds to the "exemplarCharacters" definition in CLDR. For English, this class contains small and capital letters "a" to "z," digits 0 to 9, some punctuation marks, and a few special characters like $ and &. The exact repertoire of punctuation marks is debatable, since we are accustomed to using, for example, the ASCII quotation mark " instead of proper quotation marks. We can often include ASCII special characters like *, due to their wide availability, even though they are not common in ordinary texts.


Commonly used other characters

These are less common characters that can be regarded as belonging to the language in the broad sense, such as é due to its occurrence in words of French origin, @ due to its appearance in the Internet context as well as in unit price indications, and the ellipsis, "...". Most of these characters can be replaced by the use of core characters, with some loss in typography and style. (For example, "e" could be used for é, and three period characters "..." could be used instead of the ellipsis "...".)


Additional characters in foreign words and names from "neighboring" languages

These are characters that belong to other languages but appear relatively often due to cultural connections. In English, it is not uncommon to use loanwords and names taken directly from French, Spanish, and German, for example. Therefore, characters like è, ñ, and ü are often needed in English texts. Their relevance depends on the nature of the text as well as cultural context. Typically, these characters are letters with diacritic marks, and the marks can usually be omitted without making the text incomprehensible, but it is regarded as good style to preserve them.


Other characters of the same script

This class differs from the preceding one on cultural and historical grounds, often with technological connections. In English, it is common to omit diacritic marks from, e.g., Polish or Czech names (writing, e.g., ód as Lodz), partly because such characters might not belong to ISO Latin 1, partly because they are regarded as culturally more remote than, for example, French letters.


Additional symbols

In different types of text, many additional characters other than letters are needed. The need greatly depends on the topic area. It is difficult to specify which characters might be needed in "normal" text as opposite to specialized scientific or technical usage. Their repertoire also varies by time, and in the modern world, previously unknown or rare characters like \ or have become known to many people from technical contexts. We can probably include, e.g., Greek letters α and π into this class due to their use as symbols (rather than letters) in several special contexts.


Characters from other scripts

This class is the most marginal: it includes characters that are almost never used in the language, since they belong to completely different writing systems. For English and other languages written in Latin letters, this includes Cyrillic, Thai, and Chinese characters, for example. The reason is that Russian, Thai, or Chinese words are normally written as transliterated or transcribed when used in English texts. Rare exceptions appear in some linguistic and other scientific use and textbooks of foreign languages. However, the situation is somewhat asymmetric: letters of the Latin script are relatively often used in other scripts, for writing names and other notations.

7.2.3. Language Coverage of ISO Latin Alphabets

The ISO Latin alphabets are defined by ISO 8859 standards as listed in Table 7-2. There are other ISO 8859 standards, but they define character sets that contain the ASCII characters and some collections of non-Latin letters (see Chapter 3). Note that ISO Latin 5, 6, 7, 8, 9, and 10 correspond to ISO 8859-9, -10, -13, -14, -15, and -16, respectively.

The ISO Latin alphabets were primarily designed to meet the needs of some languages used as official or regional languages in Europe and written in Latin letters. Table 7-2 summarizes the suitability of ISO Latin alphabets for them. The information is mainly derived from the ISO 8859 standards. For example, the table says that Croatian can be written in ISO Latin 2 or in ISO Latin 10 (i.e., ISO-8859-16).

As a side effect, all or some of ISO Latin alphabets cover other languages as well, such as Afrikaans, Indonesian/Malay, Swahili, and Tagalog. This issue will not be explored here.

Support to a language in some repertoire of characters is often subject to interpretation and even debate. In particular, the descriptions in the ISO 8859 standards deal with the availability of letters, not punctuation marks. Moreover, the considerations are limited to modern forms of the languages and to use "for general purpose applications in typical office environments," as ISO 8859 standards put it. To point out some other problems, some entries are marked with an asterisk *, with explanations after the table.

Table 7-2. Coverage of European languages by ISO Latin alphabets

Language

ISO Latin

         

Notes

Albanian

1

2

  

5

  

8

9

10

 

Basque

1

 

3

 

5

  

8

9

  

Breton

1

   

5

  

8

9

  

Catalan

1

 

3

 

5

  

8

9

  

Cornish

1

   

5

  

8

   

Croatian

 

2

       

10

 

Czech

 

2

         

Danish

1

  

4

5

6

7

8

9

  

Dutch

1

   

5

   

9

 

ij ligature?

English

1

2

3

4

5

6

7

8

9

10

 

Esperanto

  

3

        

Estonian

   

4

 

6

7

 

9

  

Faroese

1

    

6

  

9

  

Finnish

1*

2

3

4

5*

6

7

8*

9

10

, ?

French

1*

 

3*

 

5*

  

8*

9

10

œ, ?

Frisian

1

   

5

   

9

  

Galician

1

 

3

 

5

  

8

9

  

German

1

2

3

4

5

6

7

8

9

10

 

Greenlandic

1

  

4

5

6

 

8

9

  

Hungarian

 

2

       

10

 

Icelandic

1

    

6

  

9

  

Irish

1

   

5*

6*

 

8

9*

10*

New orthography

Italian

1

 

3

 

5

  

8

9

10

 

Latin

1

2

3

4

5

6

7

8

9

10

 

Latvian

   

4

  

7

    

Lithuanian

   

4

 

6

7

    

Luxemburgish

1

   

5

  

8

9

  

Maltese

  

3

        

Manx Gaelic

       

8

   

Norwegian

1

  

4

5

6

7

8

9

  

Polish

 

2

    

7

  

10

 

Portuguese

1

   

5

  

8

9

  

Rhaeto-Romanic

1

   

5

  

8

9

  

Romanian

 

2*

       

10

Diacritics on s, t?

Sámi

   

4*

 

6*

    

Not Skolt Sámi

Scottish Gaelic

1

   

5

   

9

  

Slovak

 

2

         

Slovenian

 

2

 

4

 

6

7

  

10

 

Sorbian

 

2

         

Spanish

1

 

3

 

5

  

8

9

  

Swedish

1

  

4

5

6

7

8

9

  

Turkish

  

3*

 

5

     

3 deprecated

Welsh

       

8

   


Explanations to Table 7-1:

  • Dutch has (arguably) an ij ligature, which does not belong to any ISO Latin alphabet.

  • Finnish official orthography contains and , which are not covered by ISO Latin 1, 5, and 8.

  • , which are not covered by ISO Latin 1, 3, 5, and 8.

  • Romanian uses letters "s" and "t" with a diacritic mark below them. According to the Romanian Standards Institute, this diacritic mark is not a cedilla but a comma below. According to this interpretation, no ISO Latin alphabet except the ISO Latin 10 is suitable for Romanian. However, according to ISO 8859-2, Latin alphabet No. 2 can be used "subject to the agreement of originator and receiver in information exchange." Effectively, "s" and "t" with cedilla (, ) can be used as substitutes.

  • Turkish can be written in ISO Latin 3 and ISO Latin 5, but the use of ISO Latin 3 for Turkish is deprecated.

We will next consider in more detail the character requirements of two languages, French and Spanish. They are rather similar in their writing systems and use of characters, as compared with the variation of world's languages. Yet, problems emerge in the details.

7.2.4. Example: Spanish

The basic character requirements of Spanish include (in addition to ASCII characters):

  • Accented characters á, é, í, ó, and ú (and their uppercase forms)

  • The letter ü (and Ü)

  • The letter ñ (and Ñ)

  • Inverted exclamation mark ¡, used at the start of an exclamation

  • Inverted question mark ¿, used at the start of a question

  • Characters ª and º, used when an ordinal number has been written with digits (e.g., 2ª = segunda "second (feminine)" and 2º = segundo "second (masculine)"

  • Em dash "'"

  • Quotation marks: double angle quotation marks («bien»), double quotation marks as in English ("bien"), and single quotation marks as in English ('bien'), with some differences in usage

Except for the dash and the curly quotation marks, Spanish is covered by the ISO Latin 1 character repertoire, and the Windows Latin 1 repertoire adds the missing characters, as well as the euro sign, €. For the purposes of writing Spanish, ISO Latin 9 (ISO 8859-15) is the same as ISO Latin 1 with the addition of the euro sign, but ISO Latin 9 is little used. Some other ISO Latin alphabets could be used for Spanish, too, but most of them lack the inverted exclamation mark and the inverted question mark.

Spanish also uses ellipsis points, "puntos suspensivos," but they are usually unspaced, unlike in recommended English practice. Therefore, they can be represented as sequences of three periods (U+002E U+002E U+002E) rather than as the horizontal ellipsis character (U+2026).

MS Word helps in writing Spanish, if it has recognized the language from the text or you tell it via Word commands. In Spanish mode, MS Word does not convert three periods to English-style ellipsis as it would otherwise do. It also changes an ! or ? at the start of a sentence to an inverted exclamation or question mark, and it changes, for example, "2a" to "2ª." Somewhat strangely, MS Word produces English-style quotation marks ("bien") in Spanish mode, even though Spanish literary usage favors guillemets («bien»).

In Spanish, the acute accent indicates the vowel as stressed, and this may imply a difference in meanings of words. However, in names, the accent rarely has a distinctive meaning. Accented letters are not counted as separate letters in the alphabet, and the accent is taken into account in alphabetic ordering at the secondary level only (i.e., for words that are otherwise the same). By the official rules, accents are used in uppercase letters, too, although it is not rare to deviate from this.

Traditionally, the combinations (digraphs) "ch" and "ll" (which denote specific phonemes in Spanish) have been regarded as separate letters, as components of the alphabet position between "c" and "d," and "l" and "m," respectively. However, in 1994, the association of academies for the Spanish decided to accept the treatment of these combinations as pairs of letters, in alphabetic ordering. Previously Spanish had, for example, "correo" < "chico," since "c" < "ch," but now the official sorting rules follow the international pattern.

The details of official Spanish orthography can be found in the document "Ortografía de la lengua española," available online via http://www.rae.es/.

7.2.5. Example: French

The basic character requirements of French include (in addition to ASCII characters):

  • Several vowels with diacritic marks: à, â, é, è, ê, ë, î, ï, ô, ù, û, ü, ÿ (and their uppercase forms)

  • The letter ç (and Ç)

  • The letter œ (and Œ)

  • Debatably, the letter æ (and Æ), in words of Latin or Greek origin (e.g., "ægosome")

  • Em dash "'"

  • Quotation marks: double angle quotation marks (« bien »), double quotation marks as in English ("bien"), and single quotation marks as in English ('bien'), with some differences in usage

Except for the character œ, the dash, and the curly quotation marks, French is covered by the ISO Latin 1 character repertoire, and the Windows Latin 1 repertoire adds the missing characters, as well as the euro sign, €.

However, there is an essential feature in French orthography that cannot be properly addressed in plain text even using Unicode. The orthography rules require thin space (espace fine) after or before some punctuation markse.g., before an exclamation mark. Naturally, such a space should be nonbreaking. This problem is discussed in the section "General Punctuation" in Chapter 8.

The letter œ, "oe" ligature, has often been written as the character pair "oe" due to character code limitations. The letter œ was one of the reasons for defining the ISO-8859-15 code (ISO Latin 9), which has not gained much popularity, since you can use œ in windows-1252, and naturally in any Unicode encoding. A normal French keyboard still has no key for œ, so some special technique is needed to type it.

There is no simple way to type æ either on a French keyboard. In practice, "ae" is very often used instead, although the dictionary of the French Academy uses æ spellings.

MS Word helps in writing French, for example, by turning, in French mode, the ASCII quotation mark to French-style quotation marks (e.g., turning the input "bien" into « bien »). Like Spanish, French uses unspaced periods for ellipsis.

Diacritic marks are essential in French, and should be used on uppercase letters, too, according to the recommendation of the French Academy. However, there are differences of opinion and expectations in this area. Therefore, programs often contain a user-settable option for allowing or disallowing accents on capital letters in French text. When they are disallowed, conversion of "égalité" to uppercase would produce "EGALITE." In MS Word, the setting is Tools Options Edit Allow accented uppercase in French. The default setting for this may depend on the version of French (e.g., so that it is normally off for the French of France, but on for Canadian French).

On a typical French keyboard ("azerty keyboard "), the methods for typing letters with a diacritic mark are different for different characters. In particular, there is no obvious way to enter the capital letters É and Ç, so the user needs to know and to use some special technique (such as Alt-0201 and Alt-0199, or Ctrl-' E in Word). The letter œ (or Œ) cannot be typed in any obvious way either.

Since French uses several diacritic marks, it's easier to get them wrong than in Spanish. For example, "e" with grave, è, and "e" with acute, é, are often confused with each other by foreigners, even though their main purpose is to indicate a difference in pronunciation. When spellchecking is enabled and French is supported in it, such confusion will almost always be detected.

There was a large reform of the use of diacritic marks in French in the 1990s. Generally, their use was reduced. Old texts and even old programs (e.g., spellcheckers) might still reflect the old rules. The new rules are described in the document "Rectification de l'orthographe," http://www.academiefrancaise.fr/langue/orthographe/plan.html.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net