Section 7.1. Writing Systems and IT

7.1. Writing Systems and IT

In information technology, we often deal with text just as any data, with no regard to its internal structure or meaning. When sending a plain text file, for example, we consider at most issues like efficiency, encoding, and checking that the data arrives unchanged. However, operations like page layout, searching, indexing, and word processing need to be sensitive at least to some features and variation of writing systems.

7.1.1. Internationalization (i18n) and Related Issues

Character code problems are part of a topic calledinternationalization, jocularly abbreviated as i18n, where 18 stands for the 18 letters between "i" and "n" in this difficult word. It is really not a matter of being international; rather, a matter of letting people use their national languages and notations. Typically, international communication on the Internet is carried out in English, but "internationalization" is meant to create realistic possibilities for communication in any language.

Internationalization mainly revolves around the problems of using various languages and writing systems (scripts). It includes questions like text directionality, which was discussed in Chapter 5. This book discusses mostly just the character-level aspects of internationalization.

Internationalization is related tolocalization, sometimes abbreviated as l10n. Localization means that data and systems are adapted to specific linguistic, cultural, and local habits and rules, collectively called a locale. In the modern approach, localization is usually based on internationalization. It is often much better to start from a neutral basis and develop mappings to different locales than to map from a specific locale to another.

The word globalization is used to denote the general idea of making things work globally as well as different practical methods and aspects. Quite often, this means internationalization followed by localization. However, it can also mean things like supporting different repertoires of characters, for any use whatsoever. The terms are often used interchangeably, or vaguely, but perhaps a useful division is the following:

Internationalization turns the internal representation of data into a neutral, easily processable and well-defined format. For example, for processing monetary data, we aim at using an internal format that always identifies the currency but does not fix the way in which such data is displayed.
Localization implements the presentation of data to users in a manner that adapts to their expectations and preferences. A sum of money stored in an internationalized format as the number 42.5 and the currency code USD (U.S. dollar) might be presented as "$42.50" to a U.S. user and as "42:50 $" to a Swedish user.
Globalization is an umbrella term that covers internationalization, localization, and other ways of making data presentation and processing truly global, so that different languages, notations, and conventions can be used.

Note that most people and most documents probably use the word "internationalization" in a broad sense that roughly corresponds to our definition of "globalization." Sometimes "globalization" is used as a very specific term to refer to software that has been internationalized and that supports localization at runtimei.e., switching between locales without restarting the program.

7.1.2. Aspects of Writing and Their IT Impact

In information technology, we usually do not need to know about the sound values of letters and other symbols. Obvious exceptions to this include language processing such as automatic speech synthesis or loose comparison of strings by their phonetic similarity (e.g., in search systems). Similarly, the meanings of words formed from characters are irrelevant to most data processing applications. There are, however, somewhat more technical aspects of writing that can be significant.

7.1.2.1. Writing direction

In normal text processing, some basic features of the writing system used in the text are significant. The problem of left-to-right versus right-to-left writing was discussed in the section "Directionality" in Chapter 5. The writing direction affects text rendering in many ways, though many people do not realize this, since they have always used left-to-right writing only.

Vertical writing means writing text in lines that run vertically from top to bottom, or sometimes from bottom to top. Whether such vertical linesi.e., columnsrun right to left or left to right is a different issue. East Asian writing has traditionally been vertical, but horizontal writing is now used, too, partly because many computer systems have been unable to produce vertical layout. Another reason is that it makes it easier to insert text (such as names and formulas) in Latin letters into a document.

Vertical writing as such is handled outside Unicode and above the character level in general, using layout tools that produce it. However, the possibility of writing vertically has some impact. The shape of some Japanese punctuation marks is different in vertical writing ; for example, the colon, :, is rotated 90 degrees. This should be handled by the rendering software as a glyph selection issue. However, there are some variants of such characters for vertical text, vertical forms, in the CJK Compatibility Forms block. Moreover, there are half-width and fullwidth variants of ASCII characters, for use in vertical writing, which in practice requires characters to be of fixed width. This width is either the width of a display cell (square) or half of it.

7.1.2.2. What does a language setting really set?

The language of text is crucial for many data processing tasks, though much of processing is completely independent of language. The effect of languages has been greatly obscured by software and documents that mix quite separate concepts with each other: writing system, language, character repertoire, character encoding, keyboard layout, etc. These are interrelated but fundamentally different things. In particular, it is crucial to distinguish between the following language settings:

The language of a program's user interface, affecting menus, error messages, etc.
Keyboard settings, which have usually been designed for some particular language and named according to it (e.g., "French keyboard")
The language of a document being written, viewed, or otherwise processed, perhaps with variation inside a document (since it may contain texts in several languages)
The user's preferred language for accessing some content, in situations where a document is available in several languages

These are all logically independent of each other, and of character encoding as well as of fonts.

The user interface language is often fixed by the program designer, according to the estimated user community. Many programs are available as different language versions, and, in some cases, you might even be able to buy a multilingual version, where the language can be changed on the fly, or at least between sessions with the program.

In Chapter 2, we discussed how the different needs of different languages could be taken into account in keyboard design, especially when using virtual keyboards. The current keyboard setting is often displayed at the bottom of the screen, using language codes like "EN" for English, etc. However, such settings really relate to the keyboard only. I am writing this with the keyboard set to "FI" (Finnish), even though I am writing in English and have the language set to English in the word processor. The reason is that I want my keyboard keys work the way that the keycaps suggest. The user interface language of the word processor (e.g., the language of commands like "File," "Edit," etc.) is yet another thing. Finally, if I visit a web page, I might have set my browser to ask primarily for a German version of a page, if available, if my native language were German.

We will next discuss the two other meanings of "language settings" by simple examples.

7.1.3. Setting the Language in Word Processing

Advanced word processors typically support more than one language, and they need to know or to guess the language of the text. The support might include:

Automatic operations on punctuation to match the rules of the language
Hyphenation and language-sensitive line breaking in general
Spellchecking (while typing, or upon specific request)
Grammar checks
Hints on synonyms for a word upon request
Translation tools of varying kinde.g., showing translations for a word upon request

When you acquire a word processor or other text-related software, it is important to consider not only the user interface language but also the language support you will need. However, you might be able to buy extra modules later, extending the program with support to new languages.

7.1.3.1. Automatic operations on punctuation

As an example, if you type the data "foo" in MS Word, with suitable language packs installed if needed, you will see and your document will actually contain:

"foo" if the document language is set to English
« foo » if the document language is set to French
"foo" if the document language is set to German
"foo" if the document language is set to Danish

This means that you can use an ordinary keyboard with just one key for a quotation mark, since the program converts it to language-specific characters. There will be some other examples on fixing punctuation by language-specific rules later in this chapter.

This is just fine when it works right. However, several things can go wrong. If the word processor has a wrong idea of the language of the text, it will not perform the conversion at all, or it will perform a wrong conversion, which is even worse. When editors combine texts from different authors and sources, they might fail to check such things. As a result, a publication might contain a mixture of styles (like "foo" and "foo" and "foo"). Unfortunately, there is often no simple way to fix such things, since the conversions take place when typing; changing the language for already typed text does not change its punctuation.

On the other hand, sometimes a conversion, although correct for the language used in the text in general, is not correct in some specific occasion. Your English text might contain a block quotation in French, and inside it, French punctuation should be used. (Whether quoted text should preserve its original punctuation is a matter of style and rules. The point here is that situations exist where people wish to preserve it.)

Sometimes a conversion of quotation marks is not desirable at all. You may need to use ASCII quotation marks, since you are writing about a computer language. In that case, you can use Ctrl-Z immediately after typing a quotation mark that was converted by MS Word. The reason is that such operations undo the automatic replacement. Thus, to produce "foo" with straight quotes, you would type "^Zfoo"^Z where ^Z denotes pressing Ctrl-Z. Alternatively, you could change the MS Word settings to disable any automatic replacement of quotation marks.

7.1.3.2. Spelling and grammar checks

Word processors and other text-oriented software often contain automatic tools for spellchecking, perhaps even for grammar and style checks. A spellchecker typically detects misspelled words and may suggest corrections. A grammar or style check operates on constructs larger than a word, and it is based on some linguistic analysis of sentences. A grammar check could detect, for example, the lack of a predicate verb in a sentence.

Opinions on the usefulness of such checks vary greatly, and so does the quality of checkers. When writing specialized text with many special terms and rare words, a spellchecker typically flags a large number of words as potentially misspelled. It may also suggest alternatives to such words, often letting the user fix his error easily, but sometimes presenting something absurd.

When writing for a wide audience, spellchecking is a very good idea. If a spellchecker does not recognize some special word that you use, odds are that many readers won't either.

When you set the text of language in a word processor, the effect depends on the extent of support for that language in the program. Perhaps the program simply records the information about language without using it in any way. It might still pass the information forward when the text is transferred to another program. Moreover, other versions of the program might use the information in a useful way. Support to a language might consist of some simple operations on punctuation marks, as described earlier. It might also include a spellchecker, grammar checker, style checker, readability checker, synonym dictionary, etc.

If you set the language and see something useful happening (e.g., quotation marks turning to chevrons when the language has been set to French), the program might still fail to do any spellchecks, even if you have enabled checking in general. The software might lack a spelling dictionary and other spelling support for a language. An easy way to check this is to write something nonsensical, like qffqgfq, and see whether the program flags it as an error.

7.1.3.3. Determining the language of text

A word processor could deduce the language of a document or a fragment of a document in different ways. In particular, MS Word uses the following techniques:

Heuristic recognition: MS Word analyzes the text and deduces the language by statistical analysis. This feature can be disabled, though. When it is enabled, you can start typing text, and after a few words, MS Word probably guesses the appropriate language and switches to it. You may observe that words indicated first as misspelled or suspicious with a red wavy underline turn into normal words.
Explicit information from user: As a user, you can click on the language indicator text at the bottom of MS Word window (e.g., the word "English" there). This opens a small window as in Figure 7-1, and there, you can select a language. This will apply to text you will type, until the language setting is changed. If you have first selected some texte.g., by double-clicking or paintingonly that fragment of text will be affected. Thus, if you have typed some text in English, and then noted that MS Word flags a name like Rhône as potentially misspelled, you can select the word by double-clicking on it and set the language to Frenchfor that word only. (You can also right-click after the selection, to get a pop-up menu with language settings as one of the available functions.)
Embedded information: If you open an existing MS Word document, it contains language information corresponding to what was deduced or expressed when writing it. MS Word will read and use that information. Similar things may happen with some other document formats as welle.g., when opening an HTML document in MS Word.

7.1.3.4. Exercise

This exercise requires MS Word or some other word processor with some support for different languages. You also need to know some basic functions in it, or to consult a manual on learning about them. With these premises, this exercise may illustrate the benefits of indicating the language:

Open some small document in a word processor.
Select all text in the document (e.g., with Ctrl-A in MS Word) and perform a spellcheck on it.
Set the word processor to check spelling when typing.
Then add some long word in another language supported by the program. Insert the word in several places. You should now see the word indicated as misspelled.

Figure 7-1. Setting the language of text in MS Word (the style and content of this window depends on the version of MS Word and previous use of languages in a document)
Set the program to use justification on both sides and word division as needed. You should now see the long word incorrectly divided, or left undivided. (If this does not happen, add it to suitable places.)
Click on one of the occurrences of the long foreign word and set its language to the correct one. You should now see the misspelling indication vanish and the word split correctly, provided of course that its language is sufficiently well supported by the word processor.

This paragraph illustrates the topic of the exercise. It contains the longish word Haupteigenschaft. If a word processor does not treat it as a German word, it probably leaves the word undivided, often causing poor formatting (too much or too little spacing between words), or divides it improperly. The proper division points are as in Haupt-ei-gen-schaft. When the word processor knows the language, the writer need not know the hyphenation rules of that language, except perhaps to fix the hyphenation of some special words.

7.1.4. Setting Language Preferences in Browsers

We will briefly discuss the language settings in web browsers. Although they are usually not very important (they relate to "language negotiation" described in Chapter 10), they have caused some confusion that needs to be cleared up. In particular, they have been confused with other, more important language settings.

A dialog for setting language preferences in Mozilla Firefox can be invoked with the command Tools Options General Languages, and the dialog window is shown in Figure 7-2. In IE 6, you would enter a similar dialog by selecting Tools Internet Options General Languages Language Preferences. As we mentioned in Chapter 1, these

Figure 7-2. Setting language preferences in Firefox

preferences are typically coupled with the setting of the default encoding (to be implied for pages that do not specify their encoding), which is something quite different.

The settings may include one or more languages, in order of preference. In the dialog, the user can typically add (or remove) languages and move them up and down in the order. Ideally, the user should list all languages she understands to some extent at least. Such settings are sent by the browser when it sends a request to a web server. The server may then use the information to select a particular language version of the requested page. Examples of this include http://www.debian.org/ and http://www.altavista.com. However, this is rare, and most bilingual or multilingual sites do not use such technology but typically just explicit language versions.

The language preferences in browsers have no effect except when a web page is available in several languages, using a particular protocol.

For completeness, we need to mention, though, that Netscape and Mozilla software may include information about the user's language preferences (into message headers), when such software is used to post an article to Usenet. This is in principle a threat to privacy.

7.1.5. Script = Writing System

The word "script" is often used instead of "writing system," and we follow suit in this book, even though some confusion is possible. To many people, "script" means a (small) program or a command file, which is very different from a writing system for human languages. Here "script" means basically a collection of letters and other characters, meant for writing human languages in a systematic way.

A script, as a writing system, is not an exact concept but matter of judgment and convention. We say that languages such as English, German, Icelandic, and Vietnamese use the Latin script, although they have different repertoires of characters. German has, in addition to the basic Latin letters "a" to "z," letters like ä. Icelandic has accented letters like á and the extra letters and , which are regarded as Latin letters by convention. Vietnamese uses multiple diacritics, although they are often dropped due to technical limitations or ignorance.

Thus, "Latin script" is a broad concept. It contains much more characters than most people imagine. What is common is the historical basis, the letters used in writing classical Latin. Different diacritic marks and even completely new characters have been added, to deal with sounds that cannot be conveniently expressed using the basic Latin letters. The reason why the Icelandic and are counted as Latin letters is not in their shape but their use in a language that uses letters "a to "z" as the basis of the alphabet. The Latin script also contains, by convention, a large set of phonetic (IPA) characters, although some of them have been rather directly derived from Greek letters, such as Latin small letter gamma ɣ (U+0263).

Other scripts include Greek, Cyrillic, Arabic, Hebrew, Hangul (Korean), and Han (Chinese) script. Although many scripts have common ancestors'in fact, the scripts used by mankind can be traced back to just a few different original scriptsthey may have diverged considerably. The Greek and Cyrillic scripts, for example, resemble the Latin script quite a lot, but there are so many changes in the alphabet as a whole that they are classified as separate scripts. For information on the nature and use of different scripts, consult the web site http://www.omniglot.com/writing/.

Many languages use and have always used a particular script. For some languages, the script has been changed to another in course of time. Turkish was once written in the Arabic script, now in the Latin script. Some languages have changed script several times, often for political reasons. Since changes often take time, a language might have two scripts in use at the same time, and such a situation might become even relatively permanent.

7.1.5.1. Categories of Scripts

In the section "Variation of Writing Systems" in Chapter 1, we described some basic categories of scripts: alphabetic, consonant, syllabic, and ideographic. The differences between these categories are more difficult to handle in automatic processing than the variation of character repertoires. For example, Greek text is displayed basically the same way as English: you put one character after another, left to right, with lines running bottom up, and breaking lines between words, unless you have some hyphenation routine. Displaying Arabic, on the other hand, requires writing right to left and selecting the shape of a character according to its position in a word. Much data-processing software and systems has been designed with the implicit assumption that everything is written pretty much the same way as English, although perhaps with some other letters.

7.1.5.2. Need for script information

In some contexts, it is useful to be able to specify the script used in a document or part of a document in a manner suitable for automatic processing. Moreover, most characters can be classified as belonging to one script only. For example, suppose that a document has been specified to be in the Latin script, or has been inferred to be in the Latin script by an analysis of its content. If the document contains an isolated Cyrillic letter, this could be an error (e.g., a user has entered a Cyrillic "A" by mistake), and in any case, it is something special that may need human attention.

Script information can also be used in pattern matching. For example, you might wish to use a pattern that corresponds to any sequence of characters in the Cyrillic script. In practice, patterns should normally also include the script name "Common," which refers to characters that appear in several scripts. Script information can be specified at different levels:

Document: The script of a document can be expressed informally, in prose (e.g., "this document contains old Turkish, written in the Arabic script"), or it can be guessed from the context, language, or even encoding. In the future, the script can also be specified formally as part of the language code specified for the document.
Fragment of a document: This could be a section, a paragraph, a sentence, or even an individual word, or other part of a document. For example, a scholarly work could be written in English but with Greek quotations in Greek letters. You might be able to use markup or out-of-the-band information to indicate the script of a fragmente.g., as part of language code.
Character: This level is covered well in the Unicode standard. As we can see, the standard assigns each character a script.

Although many blocks in Unicode contain characters from one script, and might have been named according to a script, there is no one-to-one correspondence between blocks and scripts. Some blocks contain characters from different scripts, and some scripts have been divided into several blocks (e.g., Basic Latin, Latin-1 Supplement, Latin Extended-A, etc.). Therefore, the Unicode standard defines a separate property that specifies the script of a character, Script (sc).

7.1.5.3. Scripts and spoofing

Script information has become more important due to use of mixing characters from different scripts in order to misguide people by "spoofing." The idea in the kind of spoofing discussed here is to present text to the user in a format that looks correct but internally means something different. Spoofing is possible even within one script. The familiar example is the use of "l" (lowercase letter "l") instead of "1" (digit one), or vice versa, making use of the fact that in many fonts, they are hard to distinguish. Another old example is the confusion between "O" (capital letter "o") and "0" (digit zero), although they are rather different in most modern fonts, when you see both of them.

Spoofing is a relatively modern phenomenon, since it revolves around the difference between visible shapes of characters and their internal digital representation. In the old times, it did not matter much if you typed "O" for "0" in a number, since the character you entered existed only on paper and was judged only on its appearance. In fact, some old typewriters forced people to type that way, since they lacked digits "0" and "1" altogether. In the modern world, it matters a lot whether an address, a password, or a variable name contains the letter "l" or the digit "1," since they have completely separate internal representations.

Spoofing might be accidental: people make mistakes in typing and confuse characters with each other. Spoofing might also be used with good aims: some instructions on choosing good passwords suggest that you spoofe.g., use "l" in place of "1"to make it more difficult to steal your password from a casual glimpse of it or to crack it with dictionary attacks.

For the most part, spoofing is used in attempts to break into systems or otherwise compromise their security. Perhaps the best known form of spoofing is to use Internet domain names that misleadingly resemble another. If there is a widely known web server at www.paypal.example, an attacker might set up www.paypa1.example and send, say, a million copies of an email example asking people to login at the following site: http://www.paypa1.example. They are then asked to change their password, to protect their account against some threat. The attacker would have set up a server that looks and acts like the real service being imitated but actually steals the user ID and password given on login. Such operations have often succeeded even when they rely on something as simple as the similarity of "l" and "1" in many fonts.

The particular form of spoofing that is used to mislead people into logging in somewhere and giving their confidential information is called "phishing." Users could resist such attacks by refusing to click on addresses shown in email messages, but many people are careless and lazy. It's so much easier to click (or cut and paste) than to type.

Unicode, with its large repertoire of characters, has opened new possibilities for spoofing. This is relevant in cases where national characters are used in Internet domain names. (Their use in web addresses otherwise might be relevant, too, but usually it's the domain name part, the server name, that is crucial in spoofing.) If you were able to distinguish "paypa1" from "paypal," perhaps because you were using a font that makes the difference obvious, how about "pypl? This string actually contains two occurrences of the Cyrillic small letter "a" (U+0430). It is highly unlikely that you would be able to distinguish them from the Latin small letter "a" by their appearance only, since in practically all fonts, they look exactly the same.

Proposed solutions include the display of URLs or strings in general in a manner that highlights any abnormal changes of scripte.g., by bolding any Cyrillic letter that appears between Latin letters (pyp l), or showing it in red. Alternatively, such mixtures might be banned completely, forbidden in some contexts like domain names. For a discussion of the problems and solutions, see the Unicode Technical Report #36, "Security Considerations for the Implementation of Unicode and Related Technology." In any case, such methods require easy access to machine-readable information about the script of each character.

7.1.5.4. Codes and names for scripts

A script can be identified in several ways, described in some detail below:

A four-letter code, such as "Grek" (for use in many contextse.g., in language codes)
A longer and more natural code name, such as "Greek"
A three-digit numeric code, such as "200" (not used much)
A name in some natural language; the name in English often coincides with the longer code name, but for other languages, it could be completely different (e.g., "Griechisch" or "")

There are two systems of codes for scripts, and they differ in some details: the international standard ISO 15924, "Code for the Representation of Names of Scripts," and the Unicode Standard Annex (UAX) #24, "Script Names," which is available from http://www.unicode.org/reports/tr24/. The Unicode Consortium is the Registration Authority for ISO 15924; see http://www.unicode.org/iso15924/.

UAX #24 defines both four-letter codes (such as "Latn" and "Cyrl") and more legible longer, more name-like codes (like "Latin" and "Cyrillic") for scripts. The four-letter codes match those used in ISO 15924, and they are used as components of language codes. Both types of codes are listed in the Unicode database in http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt.

The ISO 15924 standard defines codes for some scripts that can be regarded as variants of a basic script, such as "Latf" and "Latg" for old Fraktur and Gaelic variants of the Latin script. The reason for this is existing bibliographic classification, where different versions of a book printed in normal Latin (Roman), Fraktur, or Gaelic letters are recorded separately. In the UAX #24 approach, such variation is not considered as a script difference but as something to be handled at the font and glyph level. Therefore, UAX #24 defines just "Latn" as the generic identifier for the Latin script.

Somewhat similarly, UAX #24 has just the generic "Hani" script for CJK (Han) characters, whereas ISO 15924 lets you differentiate between "Hant" (traditional Chinese) and "Hans" (simplified Chinese).

On the other hand, UAX #24 basically defines the codes for scripts as used when identifying the script of a character as a member of the Unicode set of characters. In other contexts, more specific codes (referring to typographic variants) may be used.

The registry of ISO 15924 contains a table of script codes together with their "names" in English and French, at http://www.unicode.org/iso15924/iso15924-codes.html. Some of the "names" are actually short descriptions, and they may differ from the longer codes. For example, there is a script with the short code "Ital," the long code "Old_Italic," and the English name "Old Italic (Etruscan, Oscan, etc.)" and the French name "ancien italique (étrusque, osque, etc.)." The standard also defines three-digit numeric codes, which are not used much, but they might be used internally, if you need integer-valued identifications for scripts.

When information about the script of a character, fragment, or document is presented to a user, it should preferably be presented in the user's own language. The Common Locale Data Repository (CLDR), described in Chapter 11, contains names of scripts in different languages. A large comparison chart of such localized names is available at http://www.unicode.org/cldr/data/diff/by_type/localeDisplayNames_scripts.html. There are two special script codes:

Common (Zyyy): This value is assigned to characters that are used in several scripts, such as punctuation characters and special symbols. Most letterlike symbols, such as the copyright sign ©, are classified as Common, not by the script of the letter from which they have been derived. Such symbols are typically used across scripts. Unassigned code points, too, have this value.
Inherited (Qaai): This indicates that the character is to be assumed to be in the same script as the (logically) preceding character. This value is assigned to nonspacing marks. For example, the script of the combining acute accent (U+0301) is Inherited, so that when it follows a Latin letter, it is treated as belonging to the Latin script, and when it follows a Greek letter, it is treated as belonging to the Greek script.

In technical and scientific contexts, Greek letters may appear in the midst of text otherwise written in the Latin scripte.g., in names like "β-carotene" and "γ rays." Although the Greek letters usually appear in specialized meanings as symbols, Unicode treats them as Greek letters, belonging to the Greek script. However, there are exceptions for symbols encoded as separate characters. For example, the micro sign µ (U+00B5), although compatibility equivalent to Greek small letter mu, is defined as belonging to the Common script. Thus, replacing a character with its compatibility equivalent may change the script.

The short (four-letter) and long codes for scripts are summarized in Table 7-1. The table also acts as an overview of writing systems, although it does not include all the historic scripts that have been used. The short code in the first column is the ISO 15924 code, and the second column contains the longer code as defined in UAX #24, using an underline character instead of a space.

Table 7-1. Short and long codes for scripts
Code	Property value alias	Explanations
Arab	Arabic	Used for Arabic, Persian, and other languages
Armn	Armenian	Used for the Armenian language
Bali		Used for Balinese in Indonesia
Batk		Used for Batak languages in Indonesia
Beng	Bengali	Used for Bengali, Assamese, etc.
Blis		Bliss symbols; easy-to-learn pictorial symbols
Bopo	Bopomofo	An alphabetic writing system for Chinese
Brah		Brahmi, an ancient script used in India
Brai	Braille	Braille; symbols touchable by fingertips
Bugi		Buginese, used in Sulawesi, Indonesia
Buhd	Buhid	Used for Buhid in the Philippines (island of Mindoro)
Cans	Canadian_Aboriginal	Unified Canadian Aboriginal Syllabics
Cham		Cham, used in Cambodia and Vietnamese
Cher	Cherokee	A syllabic script for the Cherokee language
Cirt		Cirth, a Runic-like script invented by J.R.R. Tolkien
Copt		Coptic; was used for ancient Egyptian, now liturgic
Cprt	Cypriot	An ancient script used in Cyprus
Cyrl	Cyrillic	Cyrillic; used for many Slavic and non-Slavic languages
Cyrs		Cyrillic, Old Church Slavonic variant
Deva	Devanagari	Used for several languages in India, including Hindi
Dsrt	Deseret	Invented in the 1850s (for English), still used by Mormons
Egyd		Egyptian demotic
Egyh		Egyptian hieratic
Egyp		Egyptian hieroglyphs
Ethi	Ethiopic	Used for several languages in Ethiopia
Geok		Khutsuri, a script previously used for Georgian
Geor	Georgian	Used for Georgian (Mkhedruli), spoken in the Caucasus
Glag		Glagolitic (Glagolitsa), an old script for Slavic languages
Goth	Gothic	Was used for a now-extinct Germanic language
Grek	Greek	Greek (both ancient and modern)
Gujr	Gujarati	Used for the Gujarati language in western India
Guru	Gurmukhi	Used for the Panjabi language in northern India
Hang	Hangul	The currently most common script for Korean
Hani	Han	Chinese-Japanese-Korean, known as Hanzi, Kanji, Hanja
Hano	Hanunoo	Used for Hanunóo in the Philippines (island of Mindoro)
Hans		Chinese, Simplified writing system
Hant		Chinese, Traditional writing system
Hebr	Hebrew	Used for Hebrew, Yiddish, Ladino, etc.
Hira	Hiragana	A cursive syllabic script for writing Japanese
Hmng		Pahawh Hmong, used for Hmong in East Asia
Hrkt	Katakana_Or_Hiragana	Alias for Hiragana + Katakana
Hung		Old Hungarian, a Runic system used before AD 1000
Inds		Indus (Harappan); ancient script
Ital	Old_Italic	Ancient Italic (Etruscan, Oscan, etc.)
Java		Javanese, used for the Javanese language in Indonesia
Kali		Kayah Li, used in Burma (Myanmar)
Kana	Katakana	A non-cursive syllabic script for writing Japanese
Khar		Kharoshthi, an ancient script that was used in Asia
Khmr	Khmer	Used for the Cambodian language
Knda	Kannada	Used for Kannada (Kanarese) in southern India
Laoo	Lao	Used for Lao, the main language of Laos
Latf		Latin, Fraktur (Gothic) variant
Latg		Latin, Gaelic variant
Latn	Latin	Used for a wide range of European and other languages
Lepc		Lepcha (Róng), used to write a Tibeto-Burman language
Limb	Limbu	Used for Limbu, a Tibeto-Burman language
Lina		Linear A, an ancient script used on Crete
Linb	Linear_B	Linear B, an ancient script used to write a form of Greek
Mand		Mandaean, used for Mandaic, a Semitic language
Maya		Mayan hieroglyphs
Mero		Meroïtic, used for a now-extinct language in Egypt
Mlym	Malayalam	Used for Malayalam in southern India
Mong	Mongolian	Used for Mongolian; a cursive script, complex shaping
Mymr	Myanmar	Used for Burmese in Burma (Myanmar)
Nkoo		N'Ko, used for Mandekan languages in western Africa
Ogam	Ogham	Was used in the fifth and sixth centuries for early Irish
Orkh		Orkhon, used to write Uyghur, a Turkic language in China
Orya	Oriya	Used for the Oriya language in eastern India
Osma	Osmanya	Used for the Somali language in Africa
Perm		Old Permic (Abur), previously used for the Komi language
Phag		'Phags-pa, was used for Mongolian and other languages
Phnx		Phoenician, an ancient consonantal alphabet
Plrd		Pollard Phonetic, used to write the Miao language in China
Qaaa		Reserved for private use (start)
Qabx		Reserved for private use (end)
Roro		Rongorongo, was used on the Easter Island
Runr	Runic	A historic European script
Sara		Sarati, a "Middle Earth" script invented by J.R.R. Tolkien
Shaw	Shavian	Shavian (Shaw), invented for phonetic writing of English
Sinh	Sinhala	Used for Sinhala (Sinhalese) in Sri Lanka
Sylo		Syloti Nagri, used for Sylheti in Bangladesh and Indica
Syrc	Syriac	Used for the Syriac language, but also for Arabic
Syre		Syriac (Estrangelo variant)
Syrj		Syriac (Western variant)
Syrn		Syriac (Eastern variant)
Tagb	Tagbanwa	Used for Tagbanwa in the Philippines (island of Palawan)
Tale	Tai_Le	Tai Le (Dehong Dai), used in southwest China
Talu		New Tai Lue, used to write Lue in East Asia
Taml	Tamil	Used for the Tamil language in India, Sri Lanka, etc.
Telu	Telugu	Used for the Telugu language in southern India
Teng		Tengwar, a script invented by J.R.R. Tolkien
Tfng		Tifinagh, used to write Berber languages like Tamasheq
Tglg	Tagalog	Was used to write Tagalog and other Filipino languages
Thaa	Thaana	Thaana, for the Dhivehi languages (in the Maldives)
Thai	Thai	Used for Thai, the main language of Thailand (Siam)
Tibt	Tibetan	Used for Tibetan, spoken in Tibet and Bhutan
Ugar	Ugaritic	An ancient cuneiform script used to write Ugaritic
Vaii		Vai, a syllabary used to write the Vai language in Liberia
Visp		Visible Speech, a phonetic and "organic" script
Xpeo		Old Persian Cuneiform
Xsux		Cuneiform, Sumero-Akkadian
Yiii	Yi	A large syllabary used to write Yi (Lolo) in China
Zxxx		Code for unwritten languages
Zyyy	Common	Code for undetermined script
Zzzz		Code for uncoded script

7.1.5.5. The Script property: the script of a character

The data file that specifies values of the Script (sc) property, i.e. the script of each Unicode character, is http://www.unicode.org/Public/UNIDATA/Scripts.txt. It uses the longer names for the scripts. Its entries look like the following:

0993..09A8    ; Bengali # Lo  [22] BENGALI LETTER O..BENGALI LETTER NA

This sample line says that characters U+0933 through U+09A8 belong to the script "Bengali." Such information is sufficient for automatic classification of characters by script. The rest is a comment, mentioning the general category (Lo), the number of characters in the range (22), and the range expressed by names of characters.

For readability, the data in the file has been grouped by script. This lets you see quickly which characters are contained in a given script, but it makes it more difficult to find the script of a given character.