Section 6.3. Language Features


6.3. Language Features

Coordinating character sets is only the first part of the challenge. Even languages that share a character set may have different rules for hyphenation, spacing, quotation marks, punctuation, and so on. In addition to character shapes (glyphs ), issues such as directionality (whether the text reads left to right or right to left) and cursive joining behavior have to be taken into account. This section introduces the features included in HTML 4.01 and XHTML 1.0 and higher that address the needs of a multilingual Web.

6.3.1. Language Specification

Authors are strongly urged to specify the language for all HTML and XHTML documents. To specify a language for XHTML documents, use the xml:lang attribute in the html root element. HTML documents use the lang attribute for the same purpose . To ensure backward compatibility, the convention is simply to use both attributes, as shown in this example, which specifies the language of the document as French.

     <html xml:lang="fr" lang="fr" xmlns="http://www.w3.org/1999/xhtml" > 

Users can set language preferences in their browsers. This language preference information is passed to the server when the user makes a request for a document. The server may use it to return a document in the preferred language if there is a document available that matches the language description.


The language attributes may be used in a particular element to override the language declaration for the document. In this example, a long quotation is provided in Norwegian.

     <blockquote xml:lang="no" lang="no">...</blockquote> 

6.3.2. Language Values

The value of the lang and xml:lang attributes is a language tag as defined in "Tags for the Identification of Languages" (RFC 3066 ). Language tags consist of a primary subtag that identifies the language according to a two-or three-letter language code (according to the ISO 639 standard ), for example, fr for French or no for Norwegian. When a language has both a two-and three-letter code, the two-letter code should be used.

The complete list of ISO 639 language codes is available at the Library of Congress web site at www.loc.gov/standards/iso639-2/langcodes.html. The more common two-letter codes are provided in Table 6-2 at the end of this section.

A language tag may also contain an optional subtag that further qualifies the language by country, dialect, or script, as shown in these examples.


en-GB

English as spoken in Great Britain


en-scouse

English with a scouse (Liverpool) dialect


bs-Cyrl

Bosnian with Cyrillic script (rather than Latin script, bs-Latn)

Codes for country names are provided by the standard ISO 3166 and are available at www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html. Dialect and script language tags are registered with the IANA (Internet Assigned Numbers Authority) and are available at www.iana.org/assignments/language-tags.

Table 6-2. Two-letter codes of language names

Country

Code

Country

Code

Country

Code

Afar

aa

Armenian

hy

Oriya

or

Abkhazian

ab

Herero

hz

Ossetian

os

Avestan

ae

Interlingua

ia

Punjabi

pa

Afrikaans

af

Indonesian (formerly in)

id

Pali

pi

Akan

ak

Interlingue

ie

Polish

pl

Amharic

am

Igbo

ig

Pashto, Pushto

ps

Aragonese

an

Sichuan Yi

ii

Portuguese

pt

Arabic

ar

Inupiak

ik

Quechua

qu

Assamese

as

Icelandic

is

Rhaeto-Romance

rm

Avaric

av

Italian

it

Kirundi

rn

Aymara

ay

Inuktitut

iu

Romanian

ro

Azerbaijani

az

Japanese

ja

Russian

ru

Bashkir

ba

Javanese

jv

Kinyarwanda

rw

Belarusian

be

Javanese

jw

Sanskrit

sa

Bulgarian

bg

Georgian

ka

Sardinian

sc

Bihari

bh

Kongo

kg

Sindhi

sd

Bislama

bi

Kikuyu

ki

Northern Sami

se

Bambnara

bm

Kuanyama

kj

Sangho

sg

Bengali; Bangla

bn

Kazakh

kk

Serbo-Croatian

sh

Tibetan

bo

Greenlandic

kl

Sinhalese

si

Breton

br

Cambodian

km

Slovak

sk

Bosnian

bs

Kannada

kn

Slovenian

sl

Catalan

ca

Korean

ko

Samoan

sm

Chechen

ce

Kanuri

kr

Shona

sn

Chamorro

ch

Kashmiri

ks

Somali

so

Corsican

co

Kurdish

ku

Albanian

sq

Cree

cr

Komi

kv

Serbian

sr

Czech

cs

Cornish

kw

Swati

ss

Old Slavic

cu

Kirghiz

ky

Sesotho

st

Chuvash

cv

Latin

la

Sundanese

su

Welsh

cy

Luxembourgish

lb

Swedish

sv

Danish

da

Ganda

lg

Swahili

sw

German

de

Limburgan

li

Tamil

ta

Divehi

dv

Lingala

lm

Telugu

te

Dzongkha

dz

Lingala

ln

Tajik

tg

Ewe

ee

Laothian

lo

Thai

th

Greek

el

Lithuanian

lt

Tigrinya

ti

English

en

Luba Katanga

lu

Turkmen

tk

Esperanto

eo

Latvian

lv

Tagalog

tl

Spanish

es

Malagasy

mg

Setswana

tn

Estonian

et

Marshallese

mh

Tonga

to

Basque

eu

Maori

mi

Turkish

tr

Persian

fa

Macedonian

mk

Tsonga

ts

Fulah

ff

Malayalam

ml

Tatar

tt

Finnish

fi

Mongolian

mn

Twi

tw

Fiji

fj

Moldavian

mo

Tahitian

ty

Faroese

fo

Marathi

mr

Uighur

ug

French

fr

Malay

ms

Ukrainian

uk

Frisian

fy

Maltese

mt

Urdu

ur

Irish

ga

Burmese

my

Uzbek

uz

Scots Gaelic

gd

Nauru

na

Venda

ve

Galician

gl

Nepali

ne

Vietnamese

vi

Guarani

gn

Ndonga

ng

Volapuk

vo

Gujarati

gu

Dutch

nl

Walloon

wa

Manx

gv

Nynorsk

nn

Wolof

wo

Hausa

ha

Norwegian

no

Xhosa

xh

Hebrew (formerly iw)

he

Ndebele

nr

Yiddish (formerly ji)

yi

Hindi

hi

Navaho

nv

Yoruba

yo

Hiri Motu

ho

Chichewa

ny

Zhuang

za

Croatian

hr

Occitan

oc

Chinese

zh

Haitian

ht

Ojibwa

oj

Zuni

zu

Hungarian

hu

(Afan) Oromo

om

  


6.3.3. Directionality

HTML 4.01 and XHTML take into account that many languages read from right to left and provide attributes for handling the directionality of text. Directionality is part of a character's encoding within Unicode.

The dir attribute is used for specifying the direction in which the text should be interpreted. It can be used in conjunction with the lang attribute and may be added within the tags of most elements. The accepted value for direction is either ltr for "left to right" or rtl for "right to left." For example, the following code indicates that the paragraph is intended to be displayed in Arabic, reading from right to left:

     <p lang="ar" xml:lang="ar" dir="rtl">...</p> 

The bdo element, introduced in HTML 4.01, also deals specifically with documents that contain combinations of left- and right-reading text (bidirectional text, or bidi, for short). The bdo element is used for "bidirectional override," in other words, it specifies a span of text that should override the intrinsic direction (as inherited from Unicode) of the text it contains. The bdo element uses the dir attribute as follows:

 <bdo dir="ltr"> English phrase in an otherwise Hebrew text </bdo>... 

6.3.4. Cursive Joining Behavior

In some writing systems , the shape of a character varies depending on its position in the word. For instance, in Arabic, a character used at the beginning of a word looks completely different when it is used as the last character of a word. Generally, this joining behavior is handled within the software, but there are Unicode characters that give precise control over joining behavior. They have zero width and are placed between characters purely to specify whether the neighboring characters should join.

HTML 4.01 provides mnemonic character entities for both these characters, as shown in Table 6-3.

Table 6-3. Unicode characters for joining behavior

Entity

Numeric

Name

Description

&zwnj;

&#8204;

zero-width non-joiner

Prevents joining of characters that would otherwise be joined.

&zwj;

&#8205;

zero-width joiner

Joins characters that would otherwise not be joined.





Web Design in a Nutshell
Web Design in a Nutshell: A Desktop Quick Reference (In a Nutshell (OReilly))
ISBN: 0596009879
EAN: 2147483647
Year: 2006
Pages: 325

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net