Section 7.1. Writing Systems and IT


7.1. Writing Systems and IT

In information technology, we often deal with text just as any data, with no regard to its internal structure or meaning. When sending a plain text file, for example, we consider at most issues like efficiency, encoding, and checking that the data arrives unchanged. However, operations like page layout, searching, indexing, and word processing need to be sensitive at least to some features and variation of writing systems.

7.1.1. Internationalization (i18n) and Related Issues

Character code problems are part of a topic calledinternationalization, jocularly abbreviated as i18n, where 18 stands for the 18 letters between "i" and "n" in this difficult word. It is really not a matter of being international; rather, a matter of letting people use their national languages and notations. Typically, international communication on the Internet is carried out in English, but "internationalization" is meant to create realistic possibilities for communication in any language.

Internationalization mainly revolves around the problems of using various languages and writing systems (scripts). It includes questions like text directionality, which was discussed in Chapter 5. This book discusses mostly just the character-level aspects of internationalization.

Internationalization is related tolocalization, sometimes abbreviated as l10n. Localization means that data and systems are adapted to specific linguistic, cultural, and local habits and rules, collectively called a locale. In the modern approach, localization is usually based on internationalization. It is often much better to start from a neutral basis and develop mappings to different locales than to map from a specific locale to another.

The word globalization is used to denote the general idea of making things work globally as well as different practical methods and aspects. Quite often, this means internationalization followed by localization. However, it can also mean things like supporting different repertoires of characters, for any use whatsoever. The terms are often used interchangeably, or vaguely, but perhaps a useful division is the following:

  • Internationalization turns the internal representation of data into a neutral, easily processable and well-defined format. For example, for processing monetary data, we aim at using an internal format that always identifies the currency but does not fix the way in which such data is displayed.

  • Localization implements the presentation of data to users in a manner that adapts to their expectations and preferences. A sum of money stored in an internationalized format as the number 42.5 and the currency code USD (U.S. dollar) might be presented as "$42.50" to a U.S. user and as "42:50 $" to a Swedish user.

  • Globalization is an umbrella term that covers internationalization, localization, and other ways of making data presentation and processing truly global, so that different languages, notations, and conventions can be used.

Note that most people and most documents probably use the word "internationalization" in a broad sense that roughly corresponds to our definition of "globalization." Sometimes "globalization" is used as a very specific term to refer to software that has been internationalized and that supports localization at runtimei.e., switching between locales without restarting the program.

7.1.2. Aspects of Writing and Their IT Impact

In information technology, we usually do not need to know about the sound values of letters and other symbols. Obvious exceptions to this include language processing such as automatic speech synthesis or loose comparison of strings by their phonetic similarity (e.g., in search systems). Similarly, the meanings of words formed from characters are irrelevant to most data processing applications. There are, however, somewhat more technical aspects of writing that can be significant.

7.1.2.1. Writing direction

In normal text processing, some basic features of the writing system used in the text are significant. The problem of left-to-right versus right-to-left writing was discussed in the section "Directionality" in Chapter 5. The writing direction affects text rendering in many ways, though many people do not realize this, since they have always used left-to-right writing only.

Vertical writing means writing text in lines that run vertically from top to bottom, or sometimes from bottom to top. Whether such vertical linesi.e., columnsrun right to left or left to right is a different issue. East Asian writing has traditionally been vertical, but horizontal writing is now used, too, partly because many computer systems have been unable to produce vertical layout. Another reason is that it makes it easier to insert text (such as names and formulas) in Latin letters into a document.

Vertical writing as such is handled outside Unicode and above the character level in general, using layout tools that produce it. However, the possibility of writing vertically has some impact. The shape of some Japanese punctuation marks is different in vertical writing ; for example, the colon, :, is rotated 90 degrees. This should be handled by the rendering software as a glyph selection issue. However, there are some variants of such characters for vertical text, vertical forms, in the CJK Compatibility Forms block. Moreover, there are half-width and fullwidth variants of ASCII characters, for use in vertical writing, which in practice requires characters to be of fixed width. This width is either the width of a display cell (square) or half of it.

7.1.2.2. What does a language setting really set?

The language of text is crucial for many data processing tasks, though much of processing is completely independent of language. The effect of languages has been greatly obscured by software and documents that mix quite separate concepts with each other: writing system, language, character repertoire, character encoding, keyboard layout, etc. These are interrelated but fundamentally different things. In particular, it is crucial to distinguish between the following language settings:

  • The language of a program's user interface, affecting menus, error messages, etc.

  • Keyboard settings, which have usually been designed for some particular language and named according to it (e.g., "French keyboard")

  • The language of a document being written, viewed, or otherwise processed, perhaps with variation inside a document (since it may contain texts in several languages)

  • The user's preferred language for accessing some content, in situations where a document is available in several languages

These are all logically independent of each other, and of character encoding as well as of fonts.

The user interface language is often fixed by the program designer, according to the estimated user community. Many programs are available as different language versions, and, in some cases, you might even be able to buy a multilingual version, where the language can be changed on the fly, or at least between sessions with the program.

In Chapter 2, we discussed how the different needs of different languages could be taken into account in keyboard design, especially when using virtual keyboards. The current keyboard setting is often displayed at the bottom of the screen, using language codes like "EN" for English, etc. However, such settings really relate to the keyboard only. I am writing this with the keyboard set to "FI" (Finnish), even though I am writing in English and have the language set to English in the word processor. The reason is that I want my keyboard keys work the way that the keycaps suggest. The user interface language of the word processor (e.g., the language of commands like "File," "Edit," etc.) is yet another thing. Finally, if I visit a web page, I might have set my browser to ask primarily for a German version of a page, if available, if my native language were German.

We will next discuss the two other meanings of "language settings" by simple examples.

7.1.3. Setting the Language in Word Processing

Advanced word processors typically support more than one language, and they need to know or to guess the language of the text. The support might include:

  • Automatic operations on punctuation to match the rules of the language

  • Hyphenation and language-sensitive line breaking in general

  • Spellchecking (while typing, or upon specific request)

  • Grammar checks

  • Hints on synonyms for a word upon request

  • Translation tools of varying kinde.g., showing translations for a word upon request

When you acquire a word processor or other text-related software, it is important to consider not only the user interface language but also the language support you will need. However, you might be able to buy extra modules later, extending the program with support to new languages.

7.1.3.1. Automatic operations on punctuation

As an example, if you type the data "foo" in MS Word, with suitable language packs installed if needed, you will see and your document will actually contain:

  • "foo" if the document language is set to English

  • « foo » if the document language is set to French

  • "foo" if the document language is set to German

  • "foo" if the document language is set to Danish

This means that you can use an ordinary keyboard with just one key for a quotation mark, since the program converts it to language-specific characters. There will be some other examples on fixing punctuation by language-specific rules later in this chapter.

This is just fine when it works right. However, several things can go wrong. If the word processor has a wrong idea of the language of the text, it will not perform the conversion at all, or it will perform a wrong conversion, which is even worse. When editors combine texts from different authors and sources, they might fail to check such things. As a result, a publication might contain a mixture of styles (like "foo" and "foo" and "foo"). Unfortunately, there is often no simple way to fix such things, since the conversions take place when typing; changing the language for already typed text does not change its punctuation.

On the other hand, sometimes a conversion, although correct for the language used in the text in general, is not correct in some specific occasion. Your English text might contain a block quotation in French, and inside it, French punctuation should be used. (Whether quoted text should preserve its original punctuation is a matter of style and rules. The point here is that situations exist where people wish to preserve it.)

Sometimes a conversion of quotation marks is not desirable at all. You may need to use ASCII quotation marks, since you are writing about a computer language. In that case, you can use Ctrl-Z immediately after typing a quotation mark that was converted by MS Word. The reason is that such operations undo the automatic replacement. Thus, to produce "foo" with straight quotes, you would type "^Zfoo"^Z where ^Z denotes pressing Ctrl-Z. Alternatively, you could change the MS Word settings to disable any automatic replacement of quotation marks.

7.1.3.2. Spelling and grammar checks

Word processors and other text-oriented software often contain automatic tools for spellchecking, perhaps even for grammar and style checks. A spellchecker typically detects misspelled words and may suggest corrections. A grammar or style check operates on constructs larger than a word, and it is based on some linguistic analysis of sentences. A grammar check could detect, for example, the lack of a predicate verb in a sentence.

Opinions on the usefulness of such checks vary greatly, and so does the quality of checkers. When writing specialized text with many special terms and rare words, a spellchecker typically flags a large number of words as potentially misspelled. It may also suggest alternatives to such words, often letting the user fix his error easily, but sometimes presenting something absurd.

When writing for a wide audience, spellchecking is a very good idea. If a spellchecker does not recognize some special word that you use, odds are that many readers won't either.

When you set the text of language in a word processor, the effect depends on the extent of support for that language in the program. Perhaps the program simply records the information about language without using it in any way. It might still pass the information forward when the text is transferred to another program. Moreover, other versions of the program might use the information in a useful way. Support to a language might consist of some simple operations on punctuation marks, as described earlier. It might also include a spellchecker, grammar checker, style checker, readability checker, synonym dictionary, etc.

If you set the language and see something useful happening (e.g., quotation marks turning to chevrons when the language has been set to French), the program might still fail to do any spellchecks, even if you have enabled checking in general. The software might lack a spelling dictionary and other spelling support for a language. An easy way to check this is to write something nonsensical, like qffqgfq, and see whether the program flags it as an error.

7.1.3.3. Determining the language of text

A word processor could deduce the language of a document or a fragment of a document in different ways. In particular, MS Word uses the following techniques:


Heuristic recognition

MS Word analyzes the text and deduces the language by statistical analysis. This feature can be disabled, though. When it is enabled, you can start typing text, and after a few words, MS Word probably guesses the appropriate language and switches to it. You may observe that words indicated first as misspelled or suspicious with a red wavy underline turn into normal words.


Explicit information from user

As a user, you can click on the language indicator text at the bottom of MS Word window (e.g., the word "English" there). This opens a small window as in Figure 7-1, and there, you can select a language. This will apply to text you will type, until the language setting is changed. If you have first selected some texte.g., by double-clicking or paintingonly that fragment of text will be affected. Thus, if you have typed some text in English, and then noted that MS Word flags a name like Rhône as potentially misspelled, you can select the word by double-clicking on it and set the language to Frenchfor that word only. (You can also right-click after the selection, to get a pop-up menu with language settings as one of the available functions.)


Embedded information

If you open an existing MS Word document, it contains language information corresponding to what was deduced or expressed when writing it. MS Word will read and use that information. Similar things may happen with some other document formats as welle.g., when opening an HTML document in MS Word.

7.1.3.4. Exercise

This exercise requires MS Word or some other word processor with some support for different languages. You also need to know some basic functions in it, or to consult a manual on learning about them. With these premises, this exercise may illustrate the benefits of indicating the language:

  1. Open some small document in a word processor.

  2. Select all text in the document (e.g., with Ctrl-A in MS Word) and perform a spellcheck on it.

  3. Set the word processor to check spelling when typing.

  4. Then add some long word in another language supported by the program. Insert the word in several places. You should now see the word indicated as misspelled.

    Figure 7-1. Setting the language of text in MS Word (the style and content of this window depends on the version of MS Word and previous use of languages in a document)

  5. Set the program to use justification on both sides and word division as needed. You should now see the long word incorrectly divided, or left undivided. (If this does not happen, add it to suitable places.)

  6. Click on one of the occurrences of the long foreign word and set its language to the correct one. You should now see the misspelling indication vanish and the word split correctly, provided of course that its language is sufficiently well supported by the word processor.

This paragraph illustrates the topic of the exercise. It contains the longish word Haupteigenschaft. If a word processor does not treat it as a German word, it probably leaves the word undivided, often causing poor formatting (too much or too little spacing between words), or divides it improperly. The proper division points are as in Haupt-ei-gen-schaft. When the word processor knows the language, the writer need not know the hyphenation rules of that language, except perhaps to fix the hyphenation of some special words.

7.1.4. Setting Language Preferences in Browsers

We will briefly discuss the language settings in web browsers. Although they are usually not very important (they relate to "language negotiation" described in Chapter 10), they have caused some confusion that needs to be cleared up. In particular, they have been confused with other, more important language settings.

A dialog for setting language preferences in Mozilla Firefox can be invoked with the command Tools Options General Languages, and the dialog window is shown in Figure 7-2. In IE 6, you would enter a similar dialog by selecting Tools Internet Options General Languages Language Preferences. As we mentioned in Chapter 1, these

Figure 7-2. Setting language preferences in Firefox


preferences are typically coupled with the setting of the default encoding (to be implied for pages that do not specify their encoding), which is something quite different.

The settings may include one or more languages, in order of preference. In the dialog, the user can typically add (or remove) languages and move them up and down in the order. Ideally, the user should list all languages she understands to some extent at least. Such settings are sent by the browser when it sends a request to a web server. The server may then use the information to select a particular language version of the requested page. Examples of this include http://www.debian.org/ and http://www.altavista.com. However, this is rare, and most bilingual or multilingual sites do not use such technology but typically just explicit language versions.

The language preferences in browsers have no effect except when a web page is available in several languages, using a particular protocol.


For completeness, we need to mention, though, that Netscape and Mozilla software may include information about the user's language preferences (into message headers), when such software is used to post an article to Usenet. This is in principle a threat to privacy.

7.1.5. Script = Writing System

The word "script" is often used instead of "writing system," and we follow suit in this book, even though some confusion is possible. To many people, "script" means a (small) program or a command file, which is very different from a writing system for human languages. Here "script" means basically a collection of letters and other characters, meant for writing human languages in a systematic way.

A script, as a writing system, is not an exact concept but matter of judgment and convention. We say that languages such as English, German, Icelandic, and Vietnamese use the Latin script, although they have different repertoires of characters. German has, in addition to the basic Latin letters "a" to "z," letters like ä. Icelandic has accented letters like á and the extra letters and , which are regarded as Latin letters by convention. Vietnamese uses multiple diacritics, although they are often dropped due to technical limitations or ignorance.

Thus, "Latin script" is a broad concept. It contains much more characters than most people imagine. What is common is the historical basis, the letters used in writing classical Latin. Different diacritic marks and even completely new characters have been added, to deal with sounds that cannot be conveniently expressed using the basic Latin letters. The reason why the Icelandic and are counted as Latin letters is not in their shape but their use in a language that uses letters "a to "z" as the basis of the alphabet. The Latin script also contains, by convention, a large set of phonetic (IPA) characters, although some of them have been rather directly derived from Greek letters, such as Latin small letter gamma ɣ (U+0263).

Other scripts include Greek, Cyrillic, Arabic, Hebrew, Hangul (Korean), and Han (Chinese) script. Although many scripts have common ancestors'in fact, the scripts used by mankind can be traced back to just a few different original scriptsthey may have diverged considerably. The Greek and Cyrillic scripts, for example, resemble the Latin script quite a lot, but there are so many changes in the alphabet as a whole that they are classified as separate scripts. For information on the nature and use of different scripts, consult the web site http://www.omniglot.com/writing/.

Many languages use and have always used a particular script. For some languages, the script has been changed to another in course of time. Turkish was once written in the Arabic script, now in the Latin script. Some languages have changed script several times, often for political reasons. Since changes often take time, a language might have two scripts in use at the same time, and such a situation might become even relatively permanent.

7.1.5.1. Categories of Scripts

In the section "Variation of Writing Systems" in Chapter 1, we described some basic categories of scripts: alphabetic, consonant, syllabic, and ideographic. The differences between these categories are more difficult to handle in automatic processing than the variation of character repertoires. For example, Greek text is displayed basically the same way as English: you put one character after another, left to right, with lines running bottom up, and breaking lines between words, unless you have some hyphenation routine. Displaying Arabic, on the other hand, requires writing right to left and selecting the shape of a character according to its position in a word. Much data-processing software and systems has been designed with the implicit assumption that everything is written pretty much the same way as English, although perhaps with some other letters.

7.1.5.2. Need for script information

In some contexts, it is useful to be able to specify the script used in a document or part of a document in a manner suitable for automatic processing. Moreover, most characters can be classified as belonging to one script only. For example, suppose that a document has been specified to be in the Latin script, or has been inferred to be in the Latin script by an analysis of its content. If the document contains an isolated Cyrillic letter, this could be an error (e.g., a user has entered a Cyrillic "A" by mistake), and in any case, it is something special that may need human attention.

Script information can also be used in pattern matching. For example, you might wish to use a pattern that corresponds to any sequence of characters in the Cyrillic script. In practice, patterns should normally also include the script name "Common," which refers to characters that appear in several scripts. Script information can be specified at different levels:


Document

The script of a document can be expressed informally, in prose (e.g., "this document contains old Turkish, written in the Arabic script"), or it can be guessed from the context, language, or even encoding. In the future, the script can also be specified formally as part of the language code specified for the document.


Fragment of a document

This could be a section, a paragraph, a sentence, or even an individual word, or other part of a document. For example, a scholarly work could be written in English but with Greek quotations in Greek letters. You might be able to use markup or out-of-the-band information to indicate the script of a fragmente.g., as part of language code.


Character

This level is covered well in the Unicode standard. As we can see, the standard assigns each character a script.

Although many blocks in Unicode contain characters from one script, and might have been named according to a script, there is no one-to-one correspondence between blocks and scripts. Some blocks contain characters from different scripts, and some scripts have been divided into several blocks (e.g., Basic Latin, Latin-1 Supplement, Latin Extended-A, etc.). Therefore, the Unicode standard defines a separate property that specifies the script of a character, Script (sc).

7.1.5.3. Scripts and spoofing

Script information has become more important due to use of mixing characters from different scripts in order to misguide people by "spoofing." The idea in the kind of spoofing discussed here is to present text to the user in a format that looks correct but internally means something different. Spoofing is possible even within one script. The familiar example is the use of "l" (lowercase letter "l") instead of "1" (digit one), or vice versa, making use of the fact that in many fonts, they are hard to distinguish. Another old example is the confusion between "O" (capital letter "o") and "0" (digit zero), although they are rather different in most modern fonts, when you see both of them.

Spoofing is a relatively modern phenomenon, since it revolves around the difference between visible shapes of characters and their internal digital representation. In the old times, it did not matter much if you typed "O" for "0" in a number, since the character you entered existed only on paper and was judged only on its appearance. In fact, some old typewriters forced people to type that way, since they lacked digits "0" and "1" altogether. In the modern world, it matters a lot whether an address, a password, or a variable name contains the letter "l" or the digit "1," since they have completely separate internal representations.

Spoofing might be accidental: people make mistakes in typing and confuse characters with each other. Spoofing might also be used with good aims: some instructions on choosing good passwords suggest that you spoofe.g., use "l" in place of "1"to make it more difficult to steal your password from a casual glimpse of it or to crack it with dictionary attacks.

For the most part, spoofing is used in attempts to break into systems or otherwise compromise their security. Perhaps the best known form of spoofing is to use Internet domain names that misleadingly resemble another. If there is a widely known web server at www.paypal.example, an attacker might set up www.paypa1.example and send, say, a million copies of an email example asking people to login at the following site: http://www.paypa1.example. They are then asked to change their password, to protect their account against some threat. The attacker would have set up a server that looks and acts like the real service being imitated but actually steals the user ID and password given on login. Such operations have often succeeded even when they rely on something as simple as the similarity of "l" and "1" in many fonts.

The particular form of spoofing that is used to mislead people into logging in somewhere and giving their confidential information is called "phishing." Users could resist such attacks by refusing to click on addresses shown in email messages, but many people are careless and lazy. It's so much easier to click (or cut and paste) than to type.

Unicode, with its large repertoire of characters, has opened new possibilities for spoofing. This is relevant in cases where national characters are used in Internet domain names. (Their use in web addresses otherwise might be relevant, too, but usually it's the domain name part, the server name, that is crucial in spoofing.) If you were able to distinguish "paypa1" from "paypal," perhaps because you were using a font that makes the difference obvious, how about "pypl? This string actually contains two occurrences of the Cyrillic small letter "a" (U+0430). It is highly unlikely that you would be able to distinguish them from the Latin small letter "a" by their appearance only, since in practically all fonts, they look exactly the same.

Proposed solutions include the display of URLs or strings in general in a manner that highlights any abnormal changes of scripte.g., by bolding any Cyrillic letter that appears between Latin letters (pyp l), or showing it in red. Alternatively, such mixtures might be banned completely, forbidden in some contexts like domain names. For a discussion of the problems and solutions, see the Unicode Technical Report #36, "Security Considerations for the Implementation of Unicode and Related Technology." In any case, such methods require easy access to machine-readable information about the script of each character.

7.1.5.4. Codes and names for scripts

A script can be identified in several ways, described in some detail below:

  • A four-letter code, such as "Grek" (for use in many contextse.g., in language codes)

  • A longer and more natural code name, such as "Greek"

  • A three-digit numeric code, such as "200" (not used much)

  • A name in some natural language; the name in English often coincides with the longer code name, but for other languages, it could be completely different (e.g., "Griechisch" or "")

There are two systems of codes for scripts, and they differ in some details: the international standard ISO 15924, "Code for the Representation of Names of Scripts," and the Unicode Standard Annex (UAX) #24, "Script Names," which is available from http://www.unicode.org/reports/tr24/. The Unicode Consortium is the Registration Authority for ISO 15924; see http://www.unicode.org/iso15924/.

UAX #24 defines both four-letter codes (such as "Latn" and "Cyrl") and more legible longer, more name-like codes (like "Latin" and "Cyrillic") for scripts. The four-letter codes match those used in ISO 15924, and they are used as components of language codes. Both types of codes are listed in the Unicode database in http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt.

The ISO 15924 standard defines codes for some scripts that can be regarded as variants of a basic script, such as "Latf" and "Latg" for old Fraktur and Gaelic variants of the Latin script. The reason for this is existing bibliographic classification, where different versions of a book printed in normal Latin (Roman), Fraktur, or Gaelic letters are recorded separately. In the UAX #24 approach, such variation is not considered as a script difference but as something to be handled at the font and glyph level. Therefore, UAX #24 defines just "Latn" as the generic identifier for the Latin script.

Somewhat similarly, UAX #24 has just the generic "Hani" script for CJK (Han) characters, whereas ISO 15924 lets you differentiate between "Hant" (traditional Chinese) and "Hans" (simplified Chinese).

On the other hand, UAX #24 basically defines the codes for scripts as used when identifying the script of a character as a member of the Unicode set of characters. In other contexts, more specific codes (referring to typographic variants) may be used.

The registry of ISO 15924 contains a table of script codes together with their "names" in English and French, at http://www.unicode.org/iso15924/iso15924-codes.html. Some of the "names" are actually short descriptions, and they may differ from the longer codes. For example, there is a script with the short code "Ital," the long code "Old_Italic," and the English name "Old Italic (Etruscan, Oscan, etc.)" and the French name "ancien italique (étrusque, osque, etc.)." The standard also defines three-digit numeric codes, which are not used much, but they might be used internally, if you need integer-valued identifications for scripts.

When information about the script of a character, fragment, or document is presented to a user, it should preferably be presented in the user's own language. The Common Locale Data Repository (CLDR), described in Chapter 11, contains names of scripts in different languages. A large comparison chart of such localized names is available at http://www.unicode.org/cldr/data/diff/by_type/localeDisplayNames_scripts.html. There are two special script codes:


Common (Zyyy)

This value is assigned to characters that are used in several scripts, such as punctuation characters and special symbols. Most letterlike symbols, such as the copyright sign ©, are classified as Common, not by the script of the letter from which they have been derived. Such symbols are typically used across scripts. Unassigned code points, too, have this value.


Inherited (Qaai)

This indicates that the character is to be assumed to be in the same script as the (logically) preceding character. This value is assigned to nonspacing marks. For example, the script of the combining acute accent (U+0301) is Inherited, so that when it follows a Latin letter, it is treated as belonging to the Latin script, and when it follows a Greek letter, it is treated as belonging to the Greek script.

In technical and scientific contexts, Greek letters may appear in the midst of text otherwise written in the Latin scripte.g., in names like "β-carotene" and "γ rays." Although the Greek letters usually appear in specialized meanings as symbols, Unicode treats them as Greek letters, belonging to the Greek script. However, there are exceptions for symbols encoded as separate characters. For example, the micro sign µ (U+00B5), although compatibility equivalent to Greek small letter mu, is defined as belonging to the Common script. Thus, replacing a character with its compatibility equivalent may change the script.

The short (four-letter) and long codes for scripts are summarized in Table 7-1. The table also acts as an overview of writing systems, although it does not include all the historic scripts that have been used. The short code in the first column is the ISO 15924 code, and the second column contains the longer code as defined in UAX #24, using an underline character instead of a space.

Table 7-1. Short and long codes for scripts

Code

Property value alias

Explanations

Arab

Arabic

Used for Arabic, Persian, and other languages

Armn

Armenian

Used for the Armenian language

Bali

 

Used for Balinese in Indonesia

Batk

 

Used for Batak languages in Indonesia

Beng

Bengali

Used for Bengali, Assamese, etc.

Blis

 

Bliss symbols; easy-to-learn pictorial symbols

Bopo

Bopomofo

An alphabetic writing system for Chinese

Brah

 

Brahmi, an ancient script used in India

Brai

Braille

Braille; symbols touchable by fingertips

Bugi

 

Buginese, used in Sulawesi, Indonesia

Buhd

Buhid

Used for Buhid in the Philippines (island of Mindoro)

Cans

Canadian​_Aboriginal

Unified Canadian Aboriginal Syllabics

Cham

 

Cham, used in Cambodia and Vietnamese

Cher

Cherokee

A syllabic script for the Cherokee language

Cirt

 

Cirth, a Runic-like script invented by J.R.R. Tolkien

Copt

 

Coptic; was used for ancient Egyptian, now liturgic

Cprt

Cypriot

An ancient script used in Cyprus

Cyrl

Cyrillic

Cyrillic; used for many Slavic and non-Slavic languages

Cyrs

 

Cyrillic, Old Church Slavonic variant

Deva

Devanagari

Used for several languages in India, including Hindi

Dsrt

Deseret

Invented in the 1850s (for English), still used by

Mormons

Egyd

 

Egyptian demotic

Egyh

 

Egyptian hieratic

Egyp

 

Egyptian hieroglyphs

Ethi

Ethiopic

Used for several languages in Ethiopia

Geok

 

Khutsuri, a script previously used for Georgian

Geor

Georgian

Used for Georgian (Mkhedruli), spoken in the Caucasus

Glag

 

Glagolitic (Glagolitsa), an old script for Slavic languages

Goth

Gothic

Was used for a now-extinct Germanic language

Grek

Greek

Greek (both ancient and modern)

Gujr

Gujarati

Used for the Gujarati language in western India

Guru

Gurmukhi

Used for the Panjabi language in northern India

Hang

Hangul

The currently most common script for Korean

Hani

Han

Chinese-Japanese-Korean, known as Hanzi, Kanji, Hanja

Hano

Hanunoo

Used for Hanunóo in the Philippines (island of Mindoro)

Hans

 

Chinese, Simplified writing system

Hant

 

Chinese, Traditional writing system

Hebr

Hebrew

Used for Hebrew, Yiddish, Ladino, etc.

Hira

Hiragana

A cursive syllabic script for writing Japanese

Hmng

 

Pahawh Hmong, used for Hmong in East Asia

Hrkt

Katakana​_Or​_Hiragana

Alias for Hiragana + Katakana

Hung

 

Old Hungarian, a Runic system used before AD 1000

Inds

 

Indus (Harappan); ancient script

Ital

Old​_Italic

Ancient Italic (Etruscan, Oscan, etc.)

Java

 

Javanese, used for the Javanese language in Indonesia

Kali

 

Kayah Li, used in Burma (Myanmar)

Kana

Katakana

A non-cursive syllabic script for writing Japanese

Khar

 

Kharoshthi, an ancient script that was used in Asia

Khmr

Khmer

Used for the Cambodian language

Knda

Kannada

Used for Kannada (Kanarese) in southern India

Laoo

Lao

Used for Lao, the main language of Laos

Latf

 

Latin, Fraktur (Gothic) variant

Latg

 

Latin, Gaelic variant

Latn

Latin

Used for a wide range of European and other languages

Lepc

 

Lepcha (Róng), used to write a Tibeto-Burman

language

Limb

Limbu

Used for Limbu, a Tibeto-Burman language

Lina

 

Linear A, an ancient script used on Crete

Linb

Linear​_B

Linear B, an ancient script used to write a form of Greek

Mand

 

Mandaean, used for Mandaic, a Semitic language

Maya

 

Mayan hieroglyphs

Mero

 

Meroïtic, used for a now-extinct language in Egypt

Mlym

Malayalam

Used for Malayalam in southern India

Mong

Mongolian

Used for Mongolian; a cursive script, complex shaping

Mymr

Myanmar

Used for Burmese in Burma (Myanmar)

Nkoo

 

N'Ko, used for Mandekan languages in western Africa

Ogam

Ogham

Was used in the fifth and sixth centuries for early Irish

Orkh

 

Orkhon, used to write Uyghur, a Turkic language in China

Orya

Oriya

Used for the Oriya language in eastern India

Osma

Osmanya

Used for the Somali language in Africa

Perm

 

Old Permic (Abur), previously used for the Komi

language

Phag

 

'Phags-pa, was used for Mongolian and other

languages

Phnx

 

Phoenician, an ancient consonantal alphabet

Plrd

 

Pollard Phonetic, used to write the Miao language in China

Qaaa

 

Reserved for private use (start)

Qabx

 

Reserved for private use (end)

Roro

 

Rongorongo, was used on the Easter Island

Runr

Runic

A historic European script

Sara

 

Sarati, a "Middle Earth" script invented by J.R.R. Tolkien

Shaw

Shavian

Shavian (Shaw), invented for phonetic writing of English

Sinh

Sinhala

Used for Sinhala (Sinhalese) in Sri Lanka

Sylo

 

Syloti Nagri, used for Sylheti in Bangladesh and Indica

Syrc

Syriac

Used for the Syriac language, but also for Arabic

Syre

 

Syriac (Estrangelo variant)

Syrj

 

Syriac (Western variant)

Syrn

 

Syriac (Eastern variant)

Tagb

Tagbanwa

Used for Tagbanwa in the Philippines (island of

Palawan)

Tale

Tai​_Le

Tai Le (Dehong Dai), used in southwest China

Talu

 

New Tai Lue, used to write Lue in East Asia

Taml

Tamil

Used for the Tamil language in India, Sri Lanka, etc.

Telu

Telugu

Used for the Telugu language in southern India

Teng

 

Tengwar, a script invented by J.R.R. Tolkien

Tfng

 

Tifinagh, used to write Berber languages like Tamasheq

Tglg

Tagalog

Was used to write Tagalog and other Filipino languages

Thaa

Thaana

Thaana, for the Dhivehi languages (in the Maldives)

Thai

Thai

Used for Thai, the main language of Thailand (Siam)

Tibt

Tibetan

Used for Tibetan, spoken in Tibet and Bhutan

Ugar

Ugaritic

An ancient cuneiform script used to write Ugaritic

Vaii

 

Vai, a syllabary used to write the Vai language in Liberia

Visp

 

Visible Speech, a phonetic and "organic" script

Xpeo

 

Old Persian Cuneiform

Xsux

 

Cuneiform, Sumero-Akkadian

Yiii

Yi

A large syllabary used to write Yi (Lolo) in China

Zxxx

 

Code for unwritten languages

Zyyy

Common

Code for undetermined script

Zzzz

 

Code for uncoded script


7.1.5.5. The Script property: the script of a character

The data file that specifies values of the Script (sc) property, i.e. the script of each Unicode character, is http://www.unicode.org/Public/UNIDATA/Scripts.txt. It uses the longer names for the scripts. Its entries look like the following:

0993..09A8    ; Bengali # Lo  [22] BENGALI LETTER O..BENGALI LETTER NA

This sample line says that characters U+0933 through U+09A8 belong to the script "Bengali." Such information is sufficient for automatic classification of characters by script. The rest is a comment, mentioning the general category (Lo), the number of characters in the range (22), and the range expressed by names of characters.

For readability, the data in the file has been grouped by script. This lets you see quickly which characters are contained in a given script, but it makes it more difficult to find the script of a given character.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net