International Features | Developing International Software

This section probes deeper into HTML's international capabilities, all of which will assist you in the creation of world-ready software. As stated, a basic inter-national element was present in the original versions of HTML, though with HTML 4 numerous international features were added. Some of the features that assist in dealing with different character sets and encodings include the HTML charset declaration (see Chapter 3), character entity references, numeric entity references (see Chapter 3), form submission, URL encoding, cascading style sheets (CSS), and the LANG attribute.

Character Entity References

Character entity references provide symbolic names (or entities) for a set of characters and are defined by the HTML standard. Character entity references always start with the character "&," followed by the symbolic name and a semi-colon. The four entity references that you should be familiar with are:

< for the character "<" (less than)
> for the character ">" (greater than)
& for the character "&" (ampersand)
  for a nonbreaking space character

The first three entity references-<, >, and &-are essential because otherwise the characters they represent cannot be rendered in an HTML document. HTML 4 defines 248 additional entity references, covering the Greek and Latin 1 alphabets as well as additional punctuation characters. A standards-compliant browser such as Internet Explorer 5 or later will render any of these references. However, for better compatibility, it is recommended that you use the character itself within the document, rather than the entity. There is no reason to use any entity references other than the four listed previously. The additional 248 entity references exist merely to allow human editing of the HTML source, such as when someone has a keyboard that does not contain certain characters.

Other examples of entity references include the following:

€ for the euro sign ""
&umlaut; for the character "ä"

Form Submission

HTML offers a method to submit text input from the browser to a Web server via the FORM element. The FORM element can send the text input by using the HTTP methods GET or POST. With either method, the encoding of the submitted text data follows the encoding of the document in which the FORM element is embedded. For example, if the document encoding is ISO 8859-1, the character "Ä" would be submitted as 0xC4.

The HTML standard leaves undefined what would happen if a user entered a character outside ISO 8859-1 in the example just given. Internet Explorer will submit characters that have been entered into a FORM element-and that are outside the document's encoding-as decimal numeric character references. For example, if the document encoding is ISO 8859-1, the Greek character (sigma) would be submitted as "&931;".

A good method to avoid ambiguity in form submissions is to always encode HTML documents in UTF-8 or UTF-16. This way, the receiving application or service does not have to deal with translating numeric character references back to actual characters, but rather can process all possible returns the very same way.

The FORM element in HTML 4 allows you to specify an attribute called "Accept-Charset." The function of this attribute is to restrict the list of encodings acceptable to the server. Internet Explorer 5 and later will respond only to a value of UTF-8. If the browser finds

 <form accept-charset="utf-8">

Internet Explorer will submit the form data in UTF-8, regardless of the document's character encoding. With this method, it is possible to post the pages in any legacy encoding for maximum compatibility, but the server-side script can process UTF-8 for data submitted by browsers that respond to Accept-Charset. Whether a document is encoded in UTF-16 or UTF-32, the data will be submitted in UTF-8.

Table 14-1 summarizes how data of various document encodings will be submitted.

Table 14-1 Encoding used for submitting data.

Document Encoding	Accept-Charset Value	Encoding of Submitted Data
UTF-8	No relevance	UTF-8
UTF-16	No relevance	UTF-8
UTF-32	No relevance	UTF-8
Any other charset	UTF-8	UTF-8
Any other charset	Not specified	Same as document encoding

URL Encoding

RFC 1738 is the reference containing standards for URLs. It declares that only 7-bit characters can appear in URLs. This declaration allows the complete US-ASCII character set to appear inside a URL, with the exception of the characters reserved for the HTTP protocol, such as the colon (:), the ampersand (&), the pound sign (#), the question mark (?), the hyphen (-), and the space character. RFC 1738 allows characters outside the US-ASCII character set to be expressed using a percent character (%) followed by two hexadecimal digits denoting an 8-bit character, in a mechanism known as "%hh escaping." For example, "%D6" represents the Latin capital letter "O" with diaeresis in the ISO 8859-1 character encoding. RFC 1738 does not specify which character encoding should be applied to these 8-bit characters. In the final analysis, the browser and server must basically agree on the character to use for the elements of the URL.

Figure 14-1 illustrates the elements of a URL, shown using the HTTPS protocol, and the remainder of this section explains these elements one by one.

figure 14.1 the elements of a url.

Figure 14.1 - The elements of a URL.

The element called "pr" is the protocol identifier. The protocol defines the semantics of the remaining elements of the URL. In the previous figure, the secure version of HTTP is specified. All names of protocols are composed of a subset of the US-ASCII character set. No other characters are allowed.

The element called "host" gives the domain name system (DNS) name of the computer to which this request is sent. The DNS is standardized to use a subset of US-ASCII for domain names, although effort is being made to extend the DNS so that it handles characters outside the US-ASCII character set. These efforts have not yet yielded a worldwide-accepted implementation standard. Most current proposals are modeled after Row-based ASCII-Compatible Encoding (RACE), which encodes a string of Unicode code points as a relatively short string of 7-bit characters. Typically, a unique signature (for example, "bq--") precedes a RACE-encoded string.

The element called "path" identifies an object on the server given by the host name. In this example, the addressed object is an ASP page ("disv1.asp") within a certain folder in the host's file system. If escaped using the %hh notation, 8-bit characters are allowed and not uncommon for the path. There is no general standard stating the character encoding to use for the path. In practice, UTF-8 has been deployed; it generally has the best chance of succeeding in finding the desired object. Both Internet Explorer and Microsoft Internet Information Services (IIS) have standardized on using UTF-8 for the path.

The element called "query" passes one or more parameters to the object given in the path. Again, %hh-escaped 8-bit characters are allowed. The character encoding of the query is independent of any other elements of the URL, and it is up to the browser and author of the server script to agree upon this encoding. It is recommended that the browser tell the server script what the character encoding of the query is, such as by appending an extra field into the query-for example, "_charset_=utf-8."

Another useful feature in conjunction with HTML involves cascading style sheets. This feature lets you specify the desired formatting for an HTML document.

Cascading Style Sheets

CSS is designed to separate instructions for the visual representation of data from the data itself. A style sheet can be embedded inside an HTML document, or can be referenced as an external document that the browser fetches from the server. Of international relevance are three elements of style sheets: the encoding of the style sheet document itself, and the referencing of fonts by both their names and their sizes. Style sheets enable Web site authors to determine and modify the visual appear ance of a large set of documents on a Web server-or even across a range of Web servers-and across a multitude of languages to achieve a common corporate identity.

If the style sheet is shared between languages using the Latin script and languages using ideographic scripts like Chinese or Japanese, the author of the style sheet needs to allow the localizer or translator to adjust the font names to the language. This adjustment is necessary because typically the fonts for Latin-based languages do not contain ideographs. The reverse is not true; the Latin glyphs in ideographic fonts often do not meet the visual appearance standards required by Western readers. In addition, the ideographic fonts that shipped with the Chinese version of Microsoft Windows 95 and Microsoft Windows 98 use a native (Chinese) name, whereas all other language versions of these two operating systems refer to the same font by a name in Latin characters. This is also true for the fonts that shipped in the Japanese and Korean versions of Windows 95 and Windows 98. It is only since Microsoft Windows 2000 that fonts can be specified either natively or by using Latin characters, regardless of the language version of Windows 2000.

The author of the shared style sheet should also allow the localizer or translator to adjust the font size. Languages using ideographic scripts will need to increase the font size by 1 or 1.5 points to achieve the same look and feel as the Latin-based language versions of documents that use the same style sheet. Increasing the font size will also maintain readability.

A style sheet can either be embedded into the HTML document or be referenced as an external document. If embedded into the HTML document, the character encoding will be the same as that of the parent document. For an external style sheet, CSS defines the @charset rule. The @charset rule takes a string as an argument, specifying the character encoding to use. This rule needs to be placed at the very beginning of the document and can appear only once.

Here is an example:

 @charset "ISO-8859-1";

This string tells the browser to interpret characters within the cascading style sheet as ISO 8859-1. (For more information on CSS, see "Setting Direction with CSS" later in this chapter, as well as the HTML AutoLayout [HAL] document in the Misc subdirectory on the companion CD.)

The LANG Attribute

Another internationalization feature of HTML is the LANG attribute, which was defined by HTML 4 in order to indicate the language of an element. LANG refers to a "natural" language that humans use in order to speak or write; in other words, the LANG attribute excludes computer languages. (In contrast, the LANGUAGE attribute to the script tag refers exclusively to computer languages.) The allowed values for the LANG attribute follow RFC 3066 (which updated RFC 1766). A LANG attribute typically consists of a two-letter ISO 639 language identifier (or primary code), such as "en" for English, "de" for German, "ja" for Japanese, or "hi" for Hindi. The language identifier can be (but does not have to be) followed by a two-letter country code (or subcode) in accordance with the ISO 3166 set of country/region identifiers. Formally defined, the "LANG value" is a language code consisting of a primary code, followed by a dash (-) and possibly an empty set of subcodes.

For languages that do not have a language identifier or country code listed in ISO 639 or ISO 3166, respectively, the Internet Assigned Numbers Authority (IANA) can define additional languages (as it has recently done for sign languages). These particular languages have a language identifier that consists of more than two letters. Language codes starting with "x-" are user-defined and are not standardized.

The LANG attribute also helps with lexical analysis of the text, such as when checking spelling or when using speech synthesis. LANG is not used to specify the directional layout properties of the HTML document. Some browsers, including Internet Explorer, take the LANG attribute into account when choosing an appropriate font to render a section of text. In addition to using the LANG attribute for indicating the language of an element, there are certain language properties that are handy in determining what language a particular user prefers.

Other Language Properties

The Accept-Language field and the properties exposed in the HTML Object Model of Internet Explorer 5 and later are ways to determine user language preferences. Accept-Language accomplishes this by assigning quality values to a particular language. Besides determining user language preferences, the properties exposed in the HTML Object Model are useful for determining operating-system settings such as the current user locale or current system locale. (For more information on determining user language preferences and system settings, see Chapter 4, "Locale and Cultural Awareness." )

Bidirectional Layout

Like many other international features in HTML, version 4 of that standard introduced rules for the definition and handling of bidirectional text. While text that uses the Arabic, Hebrew, or Farsi script is written from right to left, documents employing these scripts can embed elements from other scripts, such as company names in Latin script that run from left to right. Because both directions are allowed within the same body of text, the arrangement is called "bidirectional." The LANG attribute, discussed in "The LANG Attribute" earlier in this chapter, has no influence on the directionality of text. The HTML DIR (direction) attribute is used to indicate the base directionality of an element: "RTL" to indicate right-to-left direction and "LTR" to indicate left-to-right direction. (For more information on bidirectional text, see Chapter 5, "Text Input, Output, and Display.") In addition to the DIR attribute, other elements that can affect the directionality of text are the bidirectional override (BDO) element and direction setting with CSS.

The BDO Element

The BDO element is used to override the shaping of text. However, the DIR attribute must be used with this element to invoke the proper direction override. The BDO element is useful for overriding the Unicode bidirectional algorithm, such as if you need to mix alphabetic and numeric characters (as when representing part numbers of a computer, for example).

Using <BDO dir="rtl"> will render the text "Mirror" as "rorriM." Using <BDO dir="ltr"> will have no effect on Latin text because the Unicode algorithm has already assigned left-to-right directionality to Latin text.

Setting Direction with CSS

Glossary

Strong character: A character from which text direction can be determined.
Left-to-right embedding (LRE) mark: Signals that a piece of text is to be treated as embedded left to right. For example, an English quotation in the middle of an Arabic sentence could be marked as being embedded left to right. (LRE affects word order, not character order.)
Right-to-left embedding (RLE) mark: Signals that a piece of text is to be treated as embedded right to left. For example, a Hebrew phrase in the middle of an English quotation could be marked as being embedded right to left. (RLE affects word order, not character order.)
Pop directional formatting (PDF): Terminates the effects of the last explicit code (either embedding or override) and restores the bidirectional state to what it was before the last LRE, RLE, right-to-left override (RLO), or left-to-right override (LRO) control characters.

CSS has a feature similar to the DIR attribute. In CSS, direction is set through the use of two style properties: Direction and Unicode-bidi. The following sections, which explain both these properties, stem directly from the CSS2 specification.

The Direction Property

This property specifies the direction of base writing (in terms of blocks of text), as well as the direction of embeddings and overrides for the Unicode bidirectional algorithm. In addition, the Direction property specifies the direction of column layout in tables, the direction of horizontal overflow, and the position of the last line of text in a block (such as the last line in a paragraph).

Values for this property have the following meanings:

ltr : This signifies left-to-right direction.
rtl : This signifies right-to-left direction.
inherit : This property takes the same value as the property for the element's parent.

For the Direction property to have any effect on inline-level elements, the Unicode-bidi property's value must be embed or override.

Note

The Direction property, when specified for table column elements, is not inherited by cells in the column since columns don't exist in the document tree. Thus CSS cannot easily capture the DIR attribute inheritance rules described in HTML 4, section 11.3.2.1.

The Unicode-bidi Property

Browsers complying with the Unicode bidirectional algorithm-for example, Internet Explorer-will display characters in the correct writing direction automatically. In cases where authors must influence the rendering with explicit directional markup, the Unicode-bidi property can be used to signal that an element opens an explicit embedding or directional override.

The values for the Unicode-bidi property are:

normal : The element follows implicit BiDi rules. This means that implicit reordering works across element boundaries based on the Direction property and on strong characters in the text.
embed : The element explicitly opens an additional embedding and is equivalent to using an LRE or RLE mark in the code. The agent must terminate the explicit embedding at the end of the element (using PDF).
bidi-override : The element explicitly forces the characters to be treated as strong characters in the direction specified by the Direction property. This is equivalent to using the BDO element. The agent must terminate the explicit override at the end of the element (using PDF). This is used for special cases, such as part numbers.
inherit : The property takes the same value as the property for the element's parent.

The Direction property has an implicit effect on inline elements when applying the Unicode bidirectional algorithm. The Unicode-bidi property is only required when explicit markup is required to achieve the desired result.

The following HTML example demonstrates that the Unicode-bidi property is required to achieve proper text layout with inline elements based on the Unicode bidirectional algorithm. It also illustrates an important design principle: Web page designers should take bidirectionality into account, both in the language proper (elements and attributes) and in any accompanying style sheets. The style sheets should be designed so that bidirectional rules are separate from other style rules.

 <HTML DIR=RTL> <HEAD> <TITLE>Direction with Styles</TITLE> <META charset=windows-1256> <STYLE>  DIV.arabic  {direction: rtl; unicode-bidi: normal;}  DIV.english  {direction: ltr; unicode-bidi: normal;}  SPAN.arabic1 {direction: rtl; unicode-bidi: normal;}  SPAN.arabic2 {direction: rtl; unicode-bidi: embed;} </STYLE> </HEAD> <BODY> <DIV class=arabic>  <P>2 english3 5  <BR>6 <B>7 </B>8</P> </DIV> <hr> <DIV class=english>  <P>english9 english10 english11 13 12  <BR>english14 english15 english16</P>  <hr>  <P>The following line has: Span.arabic1  {direction: rtl; unicode-bidi: normal;}  <P>english17   <SPAN class=arabic1>18 english19 20</SPAN>  </P>  <hr>  <P>The following line has: Span.arabic2  {direction: rtl; unicode-bidi: embed;}</P>  <P>english21   <SPAN class=arabic2>22 english23 24</SPAN>  </P> </DIV> </BODY> </HTML>

Notice the difference in the Unicode-bidi assignment given to SPAN.arabic1 and SPAN.arabic2 in the previous code sample. Figure 14-2 illustrates using the correct Unicode-bidi assignment to produce the appropriate layout.

figure 14.2 using unicode-bidi to obtain the appropriate layout.

Figure 14.2 - Using Unicode-bidi to obtain the appropriate layout.