7.4. Language MetadataMetadata is data about data. For example, the string "elf" is text data, and we can associate with it the metadata that the text is in English. This does not change the identity of characters in the data, but it may affect the interpretation and processing of the data. If accompanied with metadata that says that the string "elf" is in German, the correct interpretation would be that it is a numeral that means "11" (the word is a cognate of English "eleven"). Normally metadata is invisible, when represented using a digital data format that has provisions for metadata. In plain text, you cannot make a distinction between data and metadata. You can write "This document is in English" if you like, but structurally that would be just part of the text. In markup languages and in data formats used by word processors, metadata can be stored and processed separately. It is difficult to specify what constitutes a language, but in this context, "language" means definitely "human language" as opposite to computer languages such programming, command, and data description languages. Text in a computer language may be characterized as belonging to some human language, to some extent. For example, for the purposes of speech synthesis, comments and variable names in computer source programs need to be interpreted as belonging to some human language. 7.4.1. Need for Language InformationIn data processing, there are several situations where information about the language of text is necessary or useful. Typical examples include spelling and grammar checks, speech synthesis, and limiting searches to documents in a particular language. For example, if you are looking for information about elves and therefore search for documents containing the word "elf," you will not be very happy to see hits where the string "elf" appears as a German word that means "eleven." Information about the language of text (either a document as a whole, or a larger or smaller part thereof) could in principle be used for the following purposes, but beware that most of the uses are, in most situations, just possibilities rather than reality:
Language-dependent exceptions to collating(sorting) rules should not depend on the language of the text being sorted. Instead, they should depend on the locale setting (see Chapter 11). For example, the index of a book should be alphabetized according to the rules of the language of the book, not by the rules of the languages of the words in the index. In the Unicode context, the importance of language information is increased by the unification principle (discussed in Chapter 4). Since Unicode, when encoding text, often loses the distinction between variants of a character as used in different languages, it becomes important to be able to indicate the language. This is particularly relevant to East Asian languages. The same string of Unicode characters should be rendered differently depending on whether it is Chinese or Japanese, and this cannot generally be deduced from the characters themselves. In practice, the user can make the choice of language-dependent presentation "manually" by using a program command or switch. However, this won't work for multilingual documents containing a mixture of Chinese and Japanese, for example. Although such documents are mostly scholarly, they might appear, for example, in user interfaces for language selection as well. This calls for a method for detecting language changes within a document, from markup or otherwise. It needs to be said, though, that often the typographic context dominates. For example, Chinese quotations in Japanese dictionaries usually use Japanese-style characters. 7.4.2. Methods of Determining LanguageThe language of a document or a part of a document can be determined from:
For example, a speech synthesizer might start reading a document, but then the user realizes that it's all wrong, and he changes the program's mode so that it starts reading by French rules. Some speech synthesizers are able to read different languages, but they usually need to be told which language the text is in. Automatic analysis is widely applied by search engines like Google and AltaVista. They can search for documents in a particular language, and for this, they need to recognize the language of each document. The methods they use have not been disclosed to the public, but they are probably simple statistical methods. Word processors, too, are often able to recognize the language and select their operating mode such as hyphenation algorithm or spellcheck vocabularies and methods accordingly. There are even "language guesser" demos and services on the Web. Typically, one line of text is sufficient for guessing the language rather well. Unfortunately, search engines seem to be immune to explicit metadata about language. If Google misanalyzes your Norwegian page as Danish (thereby preventing people from finding it when they restrict the search to pages in Norwegian), there is no simple way to tell Google to reclassify it. It may help to check the spelling of your text and to make sure that there are not too many foreign words (e.g., foreign names) near the start of the document. Internet message headers are not used much for determining language. The Content-Language header has been defined for indicating the language of the intended audience, and some authoring software generates it. However, "consumers" like browsers do not use it, except in rare cases and inconsistently. 7.4.3. Language MarkupLanguage markup has been discussed much in different specifications and guides, but it is not widely used in practice yet. It has the obvious drawback that it can only be used in markup systems, not in plain text, and only in markup systems that have been designed with language markup in mind. Moreover, software for processing marked-up text usually makes little or no use of language markup. For example, if Google misanalyzes your Danish web page as being in Norwegian, you cannot fix this by explicitly declaring its language in markup. Yet some programs, such as word processors and web browsers, make some use of language markup. 7.4.3.1. Attributes for language in HTML and XMLIn HTML markup, the attribute for indicating language is lang, whereas in XML, it is xml:lang. In XHTML, you can use both. The attributes can be used for practically any markup element for which it could possibly make sense to declare its language. There are also methods for language markup in other data formats, such as XSL, SVG, SMIL, RTF, and DocBook, but here we will concentrate on the common case of HTML and XML. The value of the attribute is a language code, according to a system that will be explained shortly. Mostly the language code is just a two-letter code, such as "en" for English, "fr" for French, and "de" for German (derived from Deutsch, the name of the language in the language itself). For example, if you have the tag <html lang="en"> near the start of your HTML document, you are saying that the textual content of your document is in English, except perhaps for inner elements that have their own lang attribute. If the document contains a block of quoted text in French, you can use the markup <blockquote lang="fr">...</blockquote> for it. There is also a defined way of specifying the language of a document in Dublin Core (DC) metadata, see http://www.dublincore.org. The DC metadata can also be embedded into HTMLe.g., <meta name="DC.Language" content="en">. However, DC metadata is not used much, and it only applies to a document as a whole. Language markup is by essence logical (descriptive), not prescriptive markup. It simply says, for example, "this is in French," instead of giving any specific processing instructions. Programs may use the information the way they like, or ignore it. A good implementation will use language markup in any operations where language might matter. For example, if a program performs word division or generates speech, it is natural to expect that it uses the information about language given in markup, if available. Yet, it is possible that the program you use can perform language-specific word division or language-sensitive speech generation, yet lack support for French there. You might expect that at least a warning is given, but usually your expectations would not be met. The working draft "Authoring Techniques for XHTML & HTML Internationalization: Specifying the language of content" at http://www.w3.org/TR/i18n-html-tech-lang/ discusses several problems of language markup and its implications. 7.4.3.2. The impact of language markupDespite all the potential uses for information expressed in language markup, web browsers mostly ignore it or use it for font selection only. Actual usage includes the following:
The font selection features imply that it is generally not useful to use language markup for transliterated or transcribed text. Logically, the name remains a Russian word if transliterated as Dostoyevsky (or in some other way). Yet, if you use language markup like lang="ru for it, browsers may display it in a font different from the normal font of the text, since they use a font assigned to Russian text. This could make the name stand out in a distracting way. 7.4.3.3. Granularity of markupLanguage markup is very easy in simple cases. You just add an attribute to the tag for the root element of a document (in HTML, the <html> tag). If you have quotations in another language, you add language markup for them. The same applies to names of books and other longish fragments of text. However, as you get down to the level of individual words, what should you do with words like "status quo" (that's Latin, isn't it?) or "fiancé" (French, even if used in English text?) or with proper names of people and things? For example, the Web Accessibility Initiative (WAI) recommendations say that you should indicate all changes of language in a document, and this is a Priority 1 requirement. Yet, the WAI documents themselves don't do that for proper names. Thus, language markup is easy for large portions of text and doesn't take much time, but in such cases, programs could well deduce the language by heuristic methods. Using language markup for very small fragments of text, like words and even parts of a word, would take much time and markup. Yet, it would be essential for detecting changes in language, since a program can hardly deduce from a lone word that it is in a language different from that of the surrounding text. If a document in English mentions that the French word for "garlic" is "ail," it is unrealistic to expect that programs will recognize this (without any markup) and treat that "ail" as a French word and not an English word. Somewhat similarly, it might be impossible to deduce from a medium-size piece of text whether it is supposed to be U.S. English or British English. The text might contain spellings like "colour" and "favor," but how could a spellchecker know which one is right and which one is misspelled. The language would need to be expressed in markup using a more specific code than "en" (which indicates English in general), namely "en-US" for U.S. English or "en-GB" for British English. Although this would be easy if the author of a document knows it, you would need to add extra markup if your document in U.S. English quotes British authors, or vice versaand most writers hardly think that they need to indicate the language of quoted text if they quote text in English in a document in English. The paradox of language markup: it's easy when it's not needed. Taken to the extremes, or applied logically, language markup would apply even to parts of words in many cases. After all, if you take, say, an English word and use it in a language that uses suffixes for inflexion, the suffix and the base word logically belong to different languages. For example, "Smithin," the genitive form of "Smith" when used in Finnish, would be marked up as <span lang="en">Smith</span>in inside a document in Finnish. This would be awkward to do even with good authoring tools, and it could in practice make things worse. A speech synthesizer, for example, might pause between the base word and the suffix, when it switches mode. There are many other problems in using detailed language markup. Thus, it is best to limit it to major parts of a document only, such as expressions longer than a few words. 7.4.4. Language CodesIn order to express the language of some text in a machine-processable way, we need a system of language codes . Preferably, the codes should be easy to recognize in a program, but most importantly, they need to be systematic. We cannot really work with information about language expressed in everyone's own style and language, like "English," "anglais," or "engl." 7.4.4.1. The confusion of codesJust as there is a confusion of languages in the world, there is a confusion of language codes. Several incompatible systems are used to encode information about language in a short identifier, typically a two- or three-letter alphabetic code or a number. To some extent, the codes can be mapped to each other. However, there is no universally accepted list of languages, or anything close to that. Language code systems in use include:
The definitions of language code systems typically identify a language by its name in English (and perhaps in French, too). However, the same name might be used about different languages in different code systems. One code's language might be another code's dialect, or another code's group of languages. There isn't even a universally approved operative definition of what constitutes a language in principle. The oft-quoted statement "a language is a dialect with an army and a navy"which exists in different variants; e.g., requiring an air force as wellmight describe some of the social and political aspects involved, but it isn't really a serious definition. 7.4.4.2. ISO 639Frustrating as the confusion might be, there is luckily some uniformity in those language codes that are relevant at the character level. Such codes are generally based on the ISO 639 family of standards, often augmented by additional definitions and principles given in RFC documents about the use of language codes on the Internet. ISO 639, titled "Codes for the representation of languages," currently has two parts. ISO 639-1 defines two-letter codes for a relatively small set of languages, and ISO 639-2 defines three-letter codes for the same languages and many additional languages. There is however work in progress to extend the standard with new parts, which is shown in Table 7-4. In particular, ISO 693-3 is meant to cover all languages of the world, which means thousands of languages as opposed to hundreds of languages as in ISO 693-2. This is expected to be largely based on Ethnologue codes, for languages that have not yet been covered by existing ISO 693 codes.
For 22 languages, ISO 639-2 defines two three-letter codes, bibliographic (ISO 639-2/B) and terminological (ISO 639-2/T), such as "fre" and "fra" for French. In practice, this does not matter much, since these languages also have two-letter codes (such as "fr") defined in ISO 639-1. Policies on language codes on the Internet favor ISO 639-1 codes. By ISO 639-2, Alpha-3 codes from "qaa" to "qtx" have been reserved for local use. Thus, they will not be assigned to languages in a standard, and they can be used for special purposes by agreements between interested parties. The registration authority for ISO 639-2 is the U.S. Library of Congress, and the up-to-date list of codes is at http://www.loc.gov/standards/iso639-2/. The list contains the ISO 639-1 codes as well. Some widely used ISO 639-1 codes are listed in Table 7-5.
7.4.4.3. Language codes on the InternetIn 1995, RFC 1766 was issued under the title "Tags for the Identification of Languages." Here "tag" really means "code." The idea was to specify that an ISO 639 conformant language code is used as the primary code, optionally followed by a hyphen and a subcode, which is usually a two-letter country code as defined in ISO 3166 . ISO 3166 defines code systems for countries and some other territories. Among the systems, the two-letter alphabetic code (e.g., "FR" for France) is most widely used. Usually, but not always, this code coincides with the code used in the two-letter code of the Internet domain of the country (e.g., ".fr"). Both language codes and country codes are case-insensitive. However, the recommendation is to write language codes in lowercase, country codes in uppercase. For example, the language code for Italian is usually written as "it," whereas the country code for Italy is written "IT." As in this example, a language code is often the same as the country code for a country where the language is common. There are many exceptions, though. For example, Chinese is "zh" but China is "CN." Thus, for example, "en-US" means English as spoken in the U.S., and "en-GB" means British English, or English as spoken in the United Kingdom of Great Britain and Northern Ireland, commonly known as the U.K. Note that the ISO 3166 country code is "GB," while the Internet domain for the U.K. is ".uk."
Several Internet protocols refer to RFC 1766, but the references should probably be interpreted as referring to the newest definition of language codes. In 2001, RFC 1766 was superseded by RFC 3066 . There is work in progress to create the successor of RFC 3066, see http://www.w3.org/International/core/langtags/rfc3066bis.html. The general structure of language codes according to RFC 3066 is the following: a language code consists of a primary code ("primary subtag") and optionally one or more additional codes ("subtags"), each preceded by a hyphen-minus character "-". In practice, an underline is often used as a separator instead of a hyphen-minuse.g., "en_US"since in many contexts, the syntax of codes does not allow a minus-hyphen. The principles on primary language codes according to RFC 3066 are the following:
The rules for the secondary code ("second subtag") in a language code are:
In practice, only a few combinations of a primary code and a secondary code have practical significance at present. Although the structure of language codes permits more complicated codes, such as de-AT-1996 (Austrian variant of German, orthography as reformed in 1996), they have even less use. However, any software that processes language codes should be prepared to parse a structured code, instead of just performing simple string matching against primary codes like "en," "fr," etc. This work on the development of language codes as used on the Internet will probably result in some additional specific rules on the use of additional codes. In particular, several additional codes could be used according to the following principles:
7.4.4.4. Language codes and user interfacesLanguage codes are based on names of languages, although often on the English name rather than the name in the language itself. When presented to users, language codes should preferably be mapped to localized language namesi.e., names in the language that the user prefers. For such purposes, the CLDR database (discussed in Chapter 11) contains localized names for languages. In practice, user interfaces like language selection menus often identify the languages either by English names or by two-letter ISO 639 codes, or both (as in Figure 7-2 or on the main page of the European Union web site http://www.eu.int/). Short codes are used especially in contexts where several languages need to be expressed compactly. Sometimes flags of countries are used, raising many objections. For example, on the page http://www.google.com/language_tools, flags are used adequately to indicate countries, whereas the choice of language is by language name. The most logical method for selection between versions of documentation in different languagesfor example, in a document that acts as an entry page onlywould be to use the name of each language in the language itself. Of course, this often requires a rich repertoire of characters. It also raises the problem that people get confused with the mixture of languages, especially if they see "strange characters" and cannot easily figure out what the information is about. Ordering the languages is difficult too; often they are ordered by the ISO 639 code. 7.4.5. Language Tags in UnicodeThere are special characters for language tagging in Unicode, but their use is strongly discouraged, in general. Language tag characters are control characters that contain metadata about text. They are invisible, although they may indirectly affect the rendering of normal characters. They are meant for use in plain text (as opposed to HTML or XML, for example) and in special circumstances only. The block Tags, U+E0000..U+E007F, is used for the purpose. It contains clones of ASCII characters, defined as invisible tag characters and used to indicate language using language codes such as "en" or "en-US." For example, to indicate that subsequent text is in English, you would use the two characters U+E0065 U+E006E (clones of "e" and "n"). Any software that does not recognize language tag characters probably behaves oddly upon encountering theme.g., trying to render them visibly, instead of just ignoring them. There is a free utility LTag for constructing language tags, to be used with plain text editors in Windows. It is available from http://users.adelphia.net/~dewell/ltag.html. |