Section 6.3. Language Features

6.3. Language Features

Coordinating character sets is only the first part of the challenge. Even languages that share a character set may have different rules for hyphenation, spacing, quotation marks, punctuation, and so on. In addition to character shapes (glyphs ), issues such as directionality (whether the text reads left to right or right to left) and cursive joining behavior have to be taken into account. This section introduces the features included in HTML 4.01 and XHTML 1.0 and higher that address the needs of a multilingual Web.

6.3.1. Language Specification

Authors are strongly urged to specify the language for all HTML and XHTML documents. To specify a language for XHTML documents, use the xml:lang attribute in the html root element. HTML documents use the lang attribute for the same purpose . To ensure backward compatibility, the convention is simply to use both attributes, as shown in this example, which specifies the language of the document as French.

     <html xml:lang="fr" lang="fr" xmlns="http://www.w3.org/1999/xhtml" >

Users can set language preferences in their browsers. This language preference information is passed to the server when the user makes a request for a document. The server may use it to return a document in the preferred language if there is a document available that matches the language description.

The language attributes may be used in a particular element to override the language declaration for the document. In this example, a long quotation is provided in Norwegian.

     <blockquote xml:lang="no" lang="no">...</blockquote>

6.3.2. Language Values

The value of the lang and xml:lang attributes is a language tag as defined in "Tags for the Identification of Languages" (RFC 3066 ). Language tags consist of a primary subtag that identifies the language according to a two-or three-letter language code (according to the ISO 639 standard ), for example, fr for French or no for Norwegian. When a language has both a two-and three-letter code, the two-letter code should be used.

The complete list of ISO 639 language codes is available at the Library of Congress web site at www.loc.gov/standards/iso639-2/langcodes.html. The more common two-letter codes are provided in Table 6-2 at the end of this section.

A language tag may also contain an optional subtag that further qualifies the language by country, dialect, or script, as shown in these examples.

en-GB: English as spoken in Great Britain
en-scouse: English with a scouse (Liverpool) dialect
bs-Cyrl: Bosnian with Cyrillic script (rather than Latin script, bs-Latn)

Codes for country names are provided by the standard ISO 3166 and are available at www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html. Dialect and script language tags are registered with the IANA (Internet Assigned Numbers Authority) and are available at www.iana.org/assignments/language-tags.

Table 6-2. Two-letter codes of language names
Country	Code	Country	Code	Country	Code
Afar	`aa`	Armenian	`hy`	Oriya	`or`
Abkhazian	`ab`	Herero	`hz`	Ossetian	`os`
Avestan	`ae`	Interlingua	`ia`	Punjabi	`pa`
Afrikaans	`af`	Indonesian (formerly in)	`id`	Pali	`pi`
Akan	`ak`	Interlingue	`ie`	Polish	`pl`
Amharic	`am`	Igbo	`ig`	Pashto, Pushto	`ps`
Aragonese	`an`	Sichuan Yi	`ii`	Portuguese	`pt`
Arabic	`ar`	Inupiak	`ik`	Quechua	`qu`
Assamese	`as`	Icelandic	`is`	Rhaeto-Romance	`rm`
Avaric	`av`	Italian	`it`	Kirundi	`rn`
Aymara	`ay`	Inuktitut	`iu`	Romanian	`ro`
Azerbaijani	`az`	Japanese	`ja`	Russian	`ru`
Bashkir	`ba`	Javanese	`jv`	Kinyarwanda	`rw`
Belarusian	`be`	Javanese	`jw`	Sanskrit	`sa`
Bulgarian	`bg`	Georgian	`ka`	Sardinian	`sc`
Bihari	`bh`	Kongo	`kg`	Sindhi	`sd`
Bislama	`bi`	Kikuyu	`ki`	Northern Sami	`se`
Bambnara	`bm`	Kuanyama	`kj`	Sangho	`sg`
Bengali; Bangla	`bn`	Kazakh	`kk`	Serbo-Croatian	`sh`
Tibetan	`bo`	Greenlandic	`kl`	Sinhalese	`si`
Breton	`br`	Cambodian	`km`	Slovak	`sk`
Bosnian	`bs`	Kannada	`kn`	Slovenian	`sl`
Catalan	`ca`	Korean	`ko`	Samoan	`sm`
Chechen	`ce`	Kanuri	`kr`	Shona	`sn`
Chamorro	`ch`	Kashmiri	`ks`	Somali	`so`
Corsican	`co`	Kurdish	`ku`	Albanian	`sq`
Cree	`cr`	Komi	`kv`	Serbian	`sr`
Czech	`cs`	Cornish	`kw`	Swati	`ss`
Old Slavic	`cu`	Kirghiz	`ky`	Sesotho	`st`
Chuvash	`cv`	Latin	`la`	Sundanese	`su`
Welsh	`cy`	Luxembourgish	`lb`	Swedish	`sv`
Danish	`da`	Ganda	`lg`	Swahili	`sw`
German	`de`	Limburgan	`li`	Tamil	`ta`
Divehi	`dv`	Lingala	`lm`	Telugu	`te`
Dzongkha	`dz`	Lingala	`ln`	Tajik	`tg`
Ewe	`ee`	Laothian	`lo`	Thai	`th`
Greek	`el`	Lithuanian	`lt`	Tigrinya	`ti`
English	`en`	Luba Katanga	`lu`	Turkmen	`tk`
Esperanto	`eo`	Latvian	`lv`	Tagalog	`tl`
Spanish	`es`	Malagasy	`mg`	Setswana	`tn`
Estonian	`et`	Marshallese	`mh`	Tonga	`to`
Basque	`eu`	Maori	`mi`	Turkish	`tr`
Persian	`fa`	Macedonian	`mk`	Tsonga	`ts`
Fulah	`ff`	Malayalam	`ml`	Tatar	`tt`
Finnish	`fi`	Mongolian	`mn`	Twi	`tw`
Fiji	`fj`	Moldavian	`mo`	Tahitian	`ty`
Faroese	`fo`	Marathi	`mr`	Uighur	`ug`
French	`fr`	Malay	`ms`	Ukrainian	`uk`
Frisian	`fy`	Maltese	`mt`	Urdu	`ur`
Irish	`ga`	Burmese	`my`	Uzbek	`uz`
Scots Gaelic	`gd`	Nauru	`na`	Venda	`ve`
Galician	`gl`	Nepali	`ne`	Vietnamese	`vi`
Guarani	`gn`	Ndonga	`ng`	Volapuk	`vo`
Gujarati	`gu`	Dutch	`nl`	Walloon	`wa`
Manx	`gv`	Nynorsk	`nn`	Wolof	`wo`
Hausa	`ha`	Norwegian	`no`	Xhosa	`xh`
Hebrew (formerly iw)	`he`	Ndebele	`nr`	Yiddish (formerly ji)	`yi`
Hindi	`hi`	Navaho	`nv`	Yoruba	`yo`
Hiri Motu	`ho`	Chichewa	`ny`	Zhuang	`za`
Croatian	`hr`	Occitan	`oc`	Chinese	`zh`
Haitian	`ht`	Ojibwa	`oj`	Zuni	`zu`
Hungarian	`hu`	(Afan) Oromo	`om`

6.3.3. Directionality

HTML 4.01 and XHTML take into account that many languages read from right to left and provide attributes for handling the directionality of text. Directionality is part of a character's encoding within Unicode.

The dir attribute is used for specifying the direction in which the text should be interpreted. It can be used in conjunction with the lang attribute and may be added within the tags of most elements. The accepted value for direction is either ltr for "left to right" or rtl for "right to left." For example, the following code indicates that the paragraph is intended to be displayed in Arabic, reading from right to left:

     <p lang="ar" xml:lang="ar" dir="rtl">...</p>

The bdo element, introduced in HTML 4.01, also deals specifically with documents that contain combinations of left- and right-reading text (bidirectional text, or bidi, for short). The bdo element is used for "bidirectional override," in other words, it specifies a span of text that should override the intrinsic direction (as inherited from Unicode) of the text it contains. The bdo element uses the dir attribute as follows:

 <bdo dir="ltr"> English phrase in an otherwise Hebrew text </bdo>...

6.3.4. Cursive Joining Behavior

In some writing systems , the shape of a character varies depending on its position in the word. For instance, in Arabic, a character used at the beginning of a word looks completely different when it is used as the last character of a word. Generally, this joining behavior is handled within the software, but there are Unicode characters that give precise control over joining behavior. They have zero width and are placed between characters purely to specify whether the neighboring characters should join.

HTML 4.01 provides mnemonic character entities for both these characters, as shown in Table 6-3.

Table 6-3. Unicode characters for joining behavior
Entity	Numeric	Name	Description
`&zwnj;`	‌	zero-width non-joiner	Prevents joining of characters that would otherwise be joined.
`&zwj;`	‍	zero-width joiner	Joins characters that would otherwise not be joined.

6.3. Language Features

6.3.1. Language Specification

6.3.2. Language Values

Table 6-2. Two-letter codes of language names

6.3.3. Directionality

6.3.4. Cursive Joining Behavior

Table 6-3. Unicode characters for joining behavior