Section 7.4. Language Metadata

7.4. Language Metadata

Metadata is data about data. For example, the string "elf" is text data, and we can associate with it the metadata that the text is in English. This does not change the identity of characters in the data, but it may affect the interpretation and processing of the data. If accompanied with metadata that says that the string "elf" is in German, the correct interpretation would be that it is a numeral that means "11" (the word is a cognate of English "eleven").

Normally metadata is invisible, when represented using a digital data format that has provisions for metadata. In plain text, you cannot make a distinction between data and metadata. You can write "This document is in English" if you like, but structurally that would be just part of the text. In markup languages and in data formats used by word processors, metadata can be stored and processed separately.

It is difficult to specify what constitutes a language, but in this context, "language" means definitely "human language" as opposite to computer languages such programming, command, and data description languages. Text in a computer language may be characterized as belonging to some human language, to some extent. For example, for the purposes of speech synthesis, comments and variable names in computer source programs need to be interpreted as belonging to some human language.

7.4.1. Need for Language Information

In data processing, there are several situations where information about the language of text is necessary or useful. Typical examples include spelling and grammar checks, speech synthesis, and limiting searches to documents in a particular language. For example, if you are looking for information about elves and therefore search for documents containing the word "elf," you will not be very happy to see hits where the string "elf" appears as a German word that means "eleven."

Information about the language of text (either a document as a whole, or a larger or smaller part thereof) could in principle be used for the following purposes, but beware that most of the uses are, in most situations, just possibilities rather than reality:

Choice of fonts and glyphs (to suit language-specific typographic conventions, including appropriate use of ligatures)
Spellchecks
Grammar and style checks
Restricting searches for texts in particular languages
Speech synthesis
Presentation of text on Braille devices (as dot patterns, to be read with fingertips), since the methods of such presentation are language-dependent
Automatic operations on texte.g., fixing punctuation to match the rules of a language, showing synonyms or dictionary definitions of a word to the user, or automatically translating words or fragments of text
Informing the user about the language (e.g., responding to a user action that corresponds to the question "What's the language of this strange word?")
Hyphenation and language-sensitive line breaking in general

Language-dependent exceptions to collating(sorting) rules should not depend on the language of the text being sorted. Instead, they should depend on the locale setting (see Chapter 11). For example, the index of a book should be alphabetized according to the rules of the language of the book, not by the rules of the languages of the words in the index.

In the Unicode context, the importance of language information is increased by the unification principle (discussed in Chapter 4). Since Unicode, when encoding text, often loses the distinction between variants of a character as used in different languages, it becomes important to be able to indicate the language. This is particularly relevant to East Asian languages. The same string of Unicode characters should be rendered differently depending on whether it is Chinese or Japanese, and this cannot generally be deduced from the characters themselves.

In practice, the user can make the choice of language-dependent presentation "manually" by using a program command or switch. However, this won't work for multilingual documents containing a mixture of Chinese and Japanese, for example. Although such documents are mostly scholarly, they might appear, for example, in user interfaces for language selection as well. This calls for a method for detecting language changes within a document, from markup or otherwise. It needs to be said, though, that often the typographic context dominates. For example, Chinese quotations in Japanese dictionaries usually use Japanese-style characters.

7.4.2. Methods of Determining Language

The language of a document or a part of a document can be determined from:

Human user's view of the textual content
Automatic analysis of the contenti.e., recognition of language
Internet message headers for the document
Language markup, such as the lang attribute in HTML or xml:lang attribute in XML
Language tag characters, which are defined in Unicode, but not used much

For example, a speech synthesizer might start reading a document, but then the user realizes that it's all wrong, and he changes the program's mode so that it starts reading by French rules. Some speech synthesizers are able to read different languages, but they usually need to be told which language the text is in.

Automatic analysis is widely applied by search engines like Google and AltaVista. They can search for documents in a particular language, and for this, they need to recognize the language of each document. The methods they use have not been disclosed to the public, but they are probably simple statistical methods. Word processors, too, are often able to recognize the language and select their operating mode such as hyphenation algorithm or spellcheck vocabularies and methods accordingly. There are even "language guesser" demos and services on the Web. Typically, one line of text is sufficient for guessing the language rather well.

Unfortunately, search engines seem to be immune to explicit metadata about language. If Google misanalyzes your Norwegian page as Danish (thereby preventing people from finding it when they restrict the search to pages in Norwegian), there is no simple way to tell Google to reclassify it. It may help to check the spelling of your text and to make sure that there are not too many foreign words (e.g., foreign names) near the start of the document.

Internet message headers are not used much for determining language. The Content-Language header has been defined for indicating the language of the intended audience, and some authoring software generates it. However, "consumers" like browsers do not use it, except in rare cases and inconsistently.

7.4.3. Language Markup

Language markup has been discussed much in different specifications and guides, but it is not widely used in practice yet. It has the obvious drawback that it can only be used in markup systems, not in plain text, and only in markup systems that have been designed with language markup in mind. Moreover, software for processing marked-up text usually makes little or no use of language markup. For example, if Google misanalyzes your Danish web page as being in Norwegian, you cannot fix this by explicitly declaring its language in markup. Yet some programs, such as word processors and web browsers, make some use of language markup.

7.4.3.1. Attributes for language in HTML and XML

In HTML markup, the attribute for indicating language is lang, whereas in XML, it is xml:lang. In XHTML, you can use both. The attributes can be used for practically any markup element for which it could possibly make sense to declare its language. There are also methods for language markup in other data formats, such as XSL, SVG, SMIL, RTF, and DocBook, but here we will concentrate on the common case of HTML and XML.

The value of the attribute is a language code, according to a system that will be explained shortly. Mostly the language code is just a two-letter code, such as "en" for English, "fr" for French, and "de" for German (derived from Deutsch, the name of the language in the language itself).

For example, if you have the tag <html lang="en"> near the start of your HTML document, you are saying that the textual content of your document is in English, except perhaps for inner elements that have their own lang attribute. If the document contains a block of quoted text in French, you can use the markup <blockquote lang="fr">...</blockquote> for it.

There is also a defined way of specifying the language of a document in Dublin Core (DC) metadata, see http://www.dublincore.org. The DC metadata can also be embedded into HTMLe.g., <meta name="DC.Language" content="en">. However, DC metadata is not used much, and it only applies to a document as a whole.

Language markup is by essence logical (descriptive), not prescriptive markup. It simply says, for example, "this is in French," instead of giving any specific processing instructions. Programs may use the information the way they like, or ignore it. A good implementation will use language markup in any operations where language might matter. For example, if a program performs word division or generates speech, it is natural to expect that it uses the information about language given in markup, if available. Yet, it is possible that the program you use can perform language-specific word division or language-sensitive speech generation, yet lack support for French there. You might expect that at least a warning is given, but usually your expectations would not be met.

The working draft "Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content" at http://www.w3.org/TR/i18n-html-tech-lang/ discusses several problems of language markup and its implications.

7.4.3.2. The impact of language markup

Despite all the potential uses for information expressed in language markup, web browsers mostly ignore it or use it for font selection only. Actual usage includes the following:

Several browsers recognize the language of text for the purposes of choosing the font to be used when a document does not specify the font or the text contains a character that is not present in the specified font.
Some speech-based browsers recognize some language codes and are able to select the correct reading mode automatically.
Some browsers show the language in an element in a pop-up window, if the user requests information about an element (typically, by right clicking on it and selecting a suitable action).
Some browsers support language selectors in CSS stylesheets, allowing easier creation of styles that display different languages differently.
Some online translator programs, when asked to translate an HTML document, make some use of language markup (especially in the root element, <html lang="...">) to recognize the source language.

The font selection features imply that it is generally not useful to use language markup for transliterated or transcribed text. Logically, the name remains a Russian word if transliterated as Dostoyevsky (or in some other way). Yet, if you use language markup like lang="ru for it, browsers may display it in a font different from the normal font of the text, since they use a font assigned to Russian text. This could make the name stand out in a distracting way.

7.4.3.3. Granularity of markup

Language markup is very easy in simple cases. You just add an attribute to the tag for the root element of a document (in HTML, the <html> tag). If you have quotations in another language, you add language markup for them. The same applies to names of books and other longish fragments of text. However, as you get down to the level of individual words, what should you do with words like "status quo" (that's Latin, isn't it?) or "fiancé" (French, even if used in English text?) or with proper names of people and things? For example, the Web Accessibility Initiative (WAI) recommendations say that you should indicate all changes of language in a document, and this is a Priority 1 requirement. Yet, the WAI documents themselves don't do that for proper names.

Thus, language markup is easy for large portions of text and doesn't take much time, but in such cases, programs could well deduce the language by heuristic methods. Using language markup for very small fragments of text, like words and even parts of a word, would take much time and markup. Yet, it would be essential for detecting changes in language, since a program can hardly deduce from a lone word that it is in a language different from that of the surrounding text. If a document in English mentions that the French word for "garlic" is "ail," it is unrealistic to expect that programs will recognize this (without any markup) and treat that "ail" as a French word and not an English word.

Somewhat similarly, it might be impossible to deduce from a medium-size piece of text whether it is supposed to be U.S. English or British English. The text might contain spellings like "colour" and "favor," but how could a spellchecker know which one is right and which one is misspelled. The language would need to be expressed in markup using a more specific code than "en" (which indicates English in general), namely "en-US" for U.S. English or "en-GB" for British English. Although this would be easy if the author of a document knows it, you would need to add extra markup if your document in U.S. English quotes British authors, or vice versaand most writers hardly think that they need to indicate the language of quoted text if they quote text in English in a document in English.

The paradox of language markup: it's easy when it's not needed.

Taken to the extremes, or applied logically, language markup would apply even to parts of words in many cases. After all, if you take, say, an English word and use it in a language that uses suffixes for inflexion, the suffix and the base word logically belong to different languages. For example, "Smithin," the genitive form of "Smith" when used in Finnish, would be marked up as <span lang="en">Smith</span>in inside a document in Finnish. This would be awkward to do even with good authoring tools, and it could in practice make things worse. A speech synthesizer, for example, might pause between the base word and the suffix, when it switches mode.

There are many other problems in using detailed language markup. Thus, it is best to limit it to major parts of a document only, such as expressions longer than a few words.

7.4.4. Language Codes

In order to express the language of some text in a machine-processable way, we need a system of language codes . Preferably, the codes should be easy to recognize in a program, but most importantly, they need to be systematic. We cannot really work with information about language expressed in everyone's own style and language, like "English," "anglais," or "engl."

7.4.4.1. The confusion of codes

Just as there is a confusion of languages in the world, there is a confusion of language codes. Several incompatible systems are used to encode information about language in a short identifier, typically a two- or three-letter alphabetic code or a number. To some extent, the codes can be mapped to each other. However, there is no universally accepted list of languages, or anything close to that. Language code systems in use include:

The ISO 639 standard (see below), with two- and three-letter alphabetic codes as well numeric codes
The Ethnologue system, also known as SIL code, with three-letter codes; see http://www.ethnologue.com
MARC Code, used in libraries; see http://www.loc.gov/marc/languages/
Systems used in various computing environments; see the draft list "Language Codes: ISO 639, Microsoft and Macintosh," available at http://www.uni⁠c⁠o⁠d⁠e⁠.⁠o⁠r⁠g/unicode/onlinedat/languages.html, and the "List of Windows XP's Three Letter Acronyms for Languages," found at http://www.microsoft.com/globaldev/reference/winxp/langtla.mspx

The definitions of language code systems typically identify a language by its name in English (and perhaps in French, too). However, the same name might be used about different languages in different code systems. One code's language might be another code's dialect, or another code's group of languages. There isn't even a universally approved operative definition of what constitutes a language in principle. The oft-quoted statement "a language is a dialect with an army and a navy"which exists in different variants; e.g., requiring an air force as wellmight describe some of the social and political aspects involved, but it isn't really a serious definition.

7.4.4.2. ISO 639

Frustrating as the confusion might be, there is luckily some uniformity in those language codes that are relevant at the character level. Such codes are generally based on the ISO 639 family of standards, often augmented by additional definitions and principles given in RFC documents about the use of language codes on the Internet.

ISO 639, titled "Codes for the representation of languages," currently has two parts. ISO 639-1 defines two-letter codes for a relatively small set of languages, and ISO 639-2 defines three-letter codes for the same languages and many additional languages. There is however work in progress to extend the standard with new parts, which is shown in Table 7-4. In particular, ISO 693-3 is meant to cover all languages of the world, which means thousands of languages as opposed to hundreds of languages as in ISO 693-2. This is expected to be largely based on Ethnologue codes, for languages that have not yet been covered by existing ISO 693 codes.

Table 7-4. Parts of ISO 639, current and planned
Part	Content	Notes
ISO 639-1	Alpha-2 code	Example: "en"
ISO 639-2	Alpha-3 code	Example: "eng"
ISO 639-3	Alpha-3 code for comprehensive coverage of languages	Planned, 2006?
ISO 693-4	Implementation guidelines and general principles	Planned, 2007?
ISO 693-5	Alpha-3 code for language families and groups	Planned, 2006?

For 22 languages, ISO 639-2 defines two three-letter codes, bibliographic (ISO 639-2/B) and terminological (ISO 639-2/T), such as "fre" and "fra" for French. In practice, this does not matter much, since these languages also have two-letter codes (such as "fr") defined in ISO 639-1. Policies on language codes on the Internet favor ISO 639-1 codes.

By ISO 639-2, Alpha-3 codes from "qaa" to "qtx" have been reserved for local use. Thus, they will not be assigned to languages in a standard, and they can be used for special purposes by agreements between interested parties.

The registration authority for ISO 639-2 is the U.S. Library of Congress, and the up-to-date list of codes is at http://www.loc.gov/standards/iso639-2/. The list contains the ISO 639-1 codes as well. Some widely used ISO 639-1 codes are listed in Table 7-5.

Table 7-5. ISO 639-1 codes for some languages
Language	Code	Comments
Afrikaans	af	Spoken in South Africa
Arabic	ar	Exists in several forms that differ substantially by country
Chinese	zh	Much variation by dialect and writinge.g., zh-Hant and zh-Hans
Dutch	nl	Spoken in the Netherlands, in Belgium, etc.
English	en	Difference between en-US and en-GB relevant in spelling
Esperanto	eo	The most widely used constructed (artificial) human language
French	fr	Some variation existse.g., between fr-FR and fr-CA (Canadian)
German	de	Orthographic differences exist between language forms
Greek	el	Modern Greek (Ancient Greek has three-letter code "grc")
Hebrew	he	Written in Hebrew script
Hindi	hi	Spoken in India
Italian	it	Spoken in Italy
Japanese	ja	Written using different scripts
Korean	ko	Currently mostly written in specific Korean script, Hangul
Latin	la	Used for ancient, medieval, and modern Latin
Polish	pl	Spoken in Poland; a Slavic language written in the Latin script
Portuguese	pt	Orthographic differences between pt-PT and pt-BR (Brazilian)
Russian	ru	Written in the Cyrillic script
Spanish	es	Spoken in Spain, Latin America, and elsewhere
Vietnamese	vi	Currently mostly written in Latin letters, with many diacritics

7.4.4.3. Language codes on the Internet

In 1995, RFC 1766 was issued under the title "Tags for the Identification of Languages." Here "tag" really means "code." The idea was to specify that an ISO 639 conformant language code is used as the primary code, optionally followed by a hyphen and a subcode, which is usually a two-letter country code as defined in ISO 3166 .

ISO 3166 defines code systems for countries and some other territories. Among the systems, the two-letter alphabetic code (e.g., "FR" for France) is most widely used. Usually, but not always, this code coincides with the code used in the two-letter code of the Internet domain of the country (e.g., ".fr").

Both language codes and country codes are case-insensitive. However, the recommendation is to write language codes in lowercase, country codes in uppercase. For example, the language code for Italian is usually written as "it," whereas the country code for Italy is written "IT." As in this example, a language code is often the same as the country code for a country where the language is common. There are many exceptions, though. For example, Chinese is "zh" but China is "CN."

Thus, for example, "en-US" means English as spoken in the U.S., and "en-GB" means British English, or English as spoken in the United Kingdom of Great Britain and Northern Ireland, commonly known as the U.K. Note that the ISO 3166 country code is "GB," while the Internet domain for the U.K. is ".uk."

Although some primary language codes are the same as country codes, the two code systems are separate. In general, there is no one-to-one mapping between languages and countries.

Several Internet protocols refer to RFC 1766, but the references should probably be interpreted as referring to the newest definition of language codes. In 2001, RFC 1766 was superseded by RFC 3066 . There is work in progress to create the successor of RFC 3066, see http://www.w3.org/International/core/langtags/rfc3066bis.html.

The general structure of language codes according to RFC 3066 is the following: a language code consists of a primary code ("primary subtag") and optionally one or more additional codes ("subtags"), each preceded by a hyphen-minus character "-". In practice, an underline is often used as a separator instead of a hyphen-minuse.g., "en_US"since in many contexts, the syntax of codes does not allow a minus-hyphen. The principles on primary language codes according to RFC 3066 are the following:

Any two-letter primary code shall be as defined in ISO 639-1.
Any three-letter primary code shall be as defined in ISO 639-2. Such codes must not be used for languages that have a two-letter code (e.g., "eng" is not allowed, since English has the ISO 693-1 code "en").
The primary code "i" is reserved for language codes registered at the Internet Assigned Numbers Authority (IANA). Such registrations have not been made much. Codes so registered should not be used, if an ISO based code is available.
The primary code "x" can be used by agreements between interested parties.
No other primary code shall be used.

The rules for the secondary code ("second subtag") in a language code are:

No one-letter code shall be used.
Any two-letter code shall be a country or other territory code as defined in ISO 3166.
A code of length three to eight may be registered at IANA. It may indicate a dialect or other variant. The registry is at http://www.iana.org/assignments/language-tags.
Codes longer than eight characters should not be used.

In practice, only a few combinations of a primary code and a secondary code have practical significance at present. Although the structure of language codes permits more complicated codes, such as de-AT-1996 (Austrian variant of German, orthography as reformed in 1996), they have even less use. However, any software that processes language codes should be prepared to parse a structured code, instead of just performing simple string matching against primary codes like "en," "fr," etc.

This work on the development of language codes as used on the Internet will probably result in some additional specific rules on the use of additional codes. In particular, several additional codes could be used according to the following principles:

The additional codes, if present, would appear in the order language-script-region-variant-extension-privateuse.
Additional codes may be omitted, and mostly the lengths of codes resolve any ambiguities. For example, "en-US" has language and region only, with the script omitted (implied).
The script can be indicated by using a four-letter code. However, it should be omitted (implied) for languages that are normally written in one script only. There will be a registry of such cases. For example, "en" implies the Latin script, "Latn," and the code "en-Latn" should not be used. On the other hand, it would be adequate to use "ru-Latn" for transliterated Russian text, though this would still be vague, since you would not be able to express the transliteration scheme used.
The region code can be a two-letter country or territory code (as by ISO 3166) or a three-digit code, to be interpreted according to an IANA registry that contains a subset of numeric codes for areas (such as continents) according to a system developed by the United Nations.
Variant codes can be used for well-recognized variants of a language, such as dialects. They are at least five characters long if they start with a letter and at least four characters long if they start with a digit. For example, in "de-1996" the code "1996" identifies the orthographic variant of German defined by the reform in 1996. Specifying this variant, or the "1901" variant (referring to the older orthography), for German can be essential to having spellchecks performed as intended. (At present, you use settingsapplicable to the entire documentof a word processor to select between such orthographic variants, if it supports them at all.)
Extension codes are application-oriented and start with a code consisting of a single letter. A registry is to be set up for extension codes.
Private use codes indicate distinctions in language important in a given context by private agreement and they start with a code consisting of the letter "x." Thus, in the code "en-GB-a-some-stuff-x-foobar," "a-some-stuff" is an extension part, and "x-foobar" is a private use part.

7.4.4.4. Language codes and user interfaces

Language codes are based on names of languages, although often on the English name rather than the name in the language itself. When presented to users, language codes should preferably be mapped to localized language namesi.e., names in the language that the user prefers. For such purposes, the CLDR database (discussed in Chapter 11) contains localized names for languages.

In practice, user interfaces like language selection menus often identify the languages either by English names or by two-letter ISO 639 codes, or both (as in Figure 7-2 or on the main page of the European Union web site http://www.eu.int/). Short codes are used especially in contexts where several languages need to be expressed compactly. Sometimes flags of countries are used, raising many objections. For example, on the page http://www.google.com/language_tools, flags are used adequately to indicate countries, whereas the choice of language is by language name.

The most logical method for selection between versions of documentation in different languagesfor example, in a document that acts as an entry page onlywould be to use the name of each language in the language itself. Of course, this often requires a rich repertoire of characters. It also raises the problem that people get confused with the mixture of languages, especially if they see "strange characters" and cannot easily figure out what the information is about. Ordering the languages is difficult too; often they are ordered by the ISO 639 code.

7.4.5. Language Tags in Unicode

There are special characters for language tagging in Unicode, but their use is strongly discouraged, in general. Language tag characters are control characters that contain metadata about text. They are invisible, although they may indirectly affect the rendering of normal characters. They are meant for use in plain text (as opposed to HTML or XML, for example) and in special circumstances only.

The block Tags, U+E0000..U+E007F, is used for the purpose. It contains clones of ASCII characters, defined as invisible tag characters and used to indicate language using language codes such as "en" or "en-US." For example, to indicate that subsequent text is in English, you would use the two characters U+E0065 U+E006E (clones of "e" and "n"). Any software that does not recognize language tag characters probably behaves oddly upon encountering theme.g., trying to render them visibly, instead of just ignoring them.

There is a free utility LTag for constructing language tags, to be used with plain text editors in Windows. It is available from http://users.adelphia.net/~dewell/ltag.html.