Section 10.3. Content Negotiation and Multilingual Sites


10.3. Content Negotiation and Multilingual Sites

In the web context, content negotiation means automatic selection between alternatives, such as different language versions or differently encoded versions of web content. The negotiation takes place between a browser and a server, without direct human interference.

In content negotiation, the browser is supposed to act on behalf of the user, sending the user's preference settings as needed. This is however the weakest practical point especially in language negotiation: users generally haven't checked the settings of the browser. In Chapter 7, we described such features in browsers, but they are not widely known, and the user interfaces are rather inconvenient even to experienced users.

10.3.1. Introduction to Multilingual Web Sites

A web site can be multilingual in many ways. It may contain information about several languages, or information on some topic in different languages, but not the same information. Many sites contain different languages without being multilingual in this sense. It is rather typical that a site contains a short summary page, or a few summary pages, in Englishbut the content proper is in some other language only. In such situations, you will not encounter the problems (and possibilities) of a multilingual site. However, part of a site might be multilinguale.g., when some essential information needs to be available in many languages.

10.3.1.1. Parallel versions in different languages

In this section, multilingualism of a site means that the same textual content is available to users in different language versions, for all or at least some of the pages. Even on a multilingual site, each page is usually in one language only, at least for most of it. This is generally recommendable. Sites can be multilingual, but languages should not be mixed within a page, as a rule.

Using just one language on one page avoids several problems with character encoding, or at least gives more options in solving them.


For example, suppose that you have the same content in French and in Russian. If you use separate pages, the French page can be, for example, ISO-8859-1 encoded and the Russian page, KOI8-R encoded. If you used a single page insteade.g., with one column in French and another column in Russianyou could not use either of those encodings, or any 8-bit encoding, without special arrangements. (You could, for example, use character references like а to refer to Cyrillic letters in an ISO-8859-1 encoded page, but that would be rather awkward for large amounts of text.) Using UTF-8 would let you mix French and Russian, but UTF-8 is not always a practical choice.

Thus, in most cases, separate pages in separate languages are needed. This creates a terminological problem: the word "page" could refer to some content in general, or its expression in different languages. In the sequel, we will use "page" in the abstract sense, and use expressions like "language versions of a page" when needed.

10.3.1.2. Pages with a mix of languages

Sometimes multilingualism can be implemented so that one page contains texts in different languages. This is usually practical only if there are just a few languages and the texts are shorte.g., on a page where the main content is an image or a gallery of images, accompanied with short captions in two or a few languages.

Some content is inherently multilingual . A dictionary is the most obvious example. In the humanities, it is often appropriate to quote long passages in other languages, since the readers are assumed to know them. In teaching material, critical reviews of translations, etc., it is often necessary to present texts in different languages in parallel. For such content, you should select an encoding that lets you enter text in all the languages directly. Therefore, it is often best to choose UTF-8.

More often, a page contains names or other short expressions in different languages. This includes links to versions of the page in other languages, since such links are usually best written using words in the other languages. For short texts, character references are often a feasible way to avoid problems of encoding.

10.3.1.3. Language negotiation: automatic selection of version

Multilingualism in the sense discussed here normally means that each language version of a page is in a file of its own and can be referred to using a web address (URL) of its own. But since it would be difficult to announce the address of a French version to French-speaking people, the address of a German version to German-speaking people, etc., it would be best if the same address could be used by all.

The general idea is that you would use a single address that resolves to different specific addresses automatically. Everyone would get the page in his own language, or in the language among the available alternatives that is best understood by him. This can be partly achieved using automatic language negotiation ; on the user side, this only requires that the user specify his language preferences once in the settings of his browser.

The basic principle of language negotiation is simple. When requesting a web page, by specifying the URL, a browser sends a header that specifies the languages that the user understands, with weights that indicate their relative desirability. The web server may then use this information to select one of several versions in different languages, if it such versions exist. The same basic mechanism can also be used to negotiate on the content type (media type)i.e., to select between plain text, HTML, and Word format when available, as well as on character encoding.

However, for several reasons, the language negotiation mechanism is not sufficient (and it is not indispensable, on the other hand). In any case, the author should write explicit links, through which the user can movee.g., from a German version to a French version and vice versa. (In some situations, the user would even want to open them simultaneously to compare them or use them in parallele.g., if she does not read either language fluently but can make some use of them.)

As an example of a multilingual sitewhich by the way discusses the creation of such sitesconsider the Alis Babel site. Its generic address is http://babel.alis.com/. If the browser supports language negotiation, as most browsers in use do, then using this address (e.g., by following that link) will give you a version in English, French, Italian, German, Spanish, Swedish, or Portuguese, according to which of these languages occurs first in the user's language preferences. If, for example, Swedish is the first language there, the user gets the Swedish version, which is also accessible via its specific address http://babel.alis.com/index.sv.html. (Note that the browser does not display that but the general address, if the general address was used.)

If the server has no version that matches any of the languages in the user's preferences, then the intent is that the user sees a page that describes the situation and gives a menu of available alternatives. Some browsers however fail to do that; instead they give the user some of the alternatives in a rather random fashion. Even this isn't fatal, if that alternative contains links to the other options.

10.3.1.4. Language versus country

Quite often, page authors try to perform language selection based on the user's country, typically deduced from the Internet address, more exactly, its top-level domain. This is largely just guesswork and guaranteed to fail quite often, partly because many top-level domains (.com, .org, etc.) are not limited to one country. For example, not everyone in the .fr domain (or, more properly, using a computer in the .fr domain) speaks French as her native language, or at all. Besides, French-speaking people widely use addresses other than .fr addresses, such as .be (Belgium) or .ca (Canada).

If you still try to make a language selection guess according to the user's domain, remember that the guess will quite often be wrong. Thus, it is necessary to make available links through which the user can find a page in his preferred language.

10.3.2. Links to Language Versions

Language negotiation can greatly improve the usability of a site. It is however not necessary, even if the pages exist in different language versions. Neither should one regard it as sufficient. In any case, linking to different language versions is needed.

There are strong reasons to provide links to different language versions even if the server supports language negotiation and arrangements have been made to use that. The reasons include the following:

  • Browser support to language negotiation cannot be trusted. Some browsers have no support, but most important, the general awareness about the issue among users is still rather limited. The browser defaults typically reflect the browser's language only. Thus, the information sent by a browser can be in serious conflict with the actual preferences of the user.

  • Problems related to caches may cause the browser to get the wrong language version.

  • Users may wish to compare the different language versions or otherwise make use of them. Perhaps someone does not understand a statement in a French version even if French is his native language, but checking the corresponding statement in an English version may help (especially in areas where English is dominant in technical terminology).

  • Some users prefer reading the original version (among some languages that they know), since they know that something is always lost in translation.

  • Users may encounter language-specific versions in different waysby following a link, by using a search engine, or by using an address announced somewhere. This may mean that the entire language negotiation mechanism is bypassed. So the user might run into a page that is all Greek to him but that also exists in a language he knows. Thus, if the page has links to the other versions, it will help.

It is best to start by linking the versions to each other explicitly. After that, consider whether there is a need and a possibility to use language negotiation, too.


It is difficult to decide whether language-specific or generic links should be used within the site itself and in references to its pages from outside. Normally, generic links are preferable. However, such an approach makes things more difficult, if the user wishes to read pages in a language that is not topmost in his preferences. For example, if I'd like to know what information exists in Italian at the site http://www.debian.org, I can select the link to the Italian version on the main page. However, when I follow links there, I will get versions as determined by the language preferences in my browser, since the links are generic. I can switch to the Italian version of each page as I wish, using the explicit link, but I need to repeat this on every page. This however should probably be regarded as an exceptional case, which should be handled by the usere.g., by temporarily changing the language preferences in the browser. To summarize, links should normally be generici.e., point to URLs that are resolved with the language negotiation mechanism.

When you apply the principles suggested here, each page has a language selection menu. You don't need a separate language selection pagei.e., a page that has no real content

Figure 10-6. A set of language links, using codes


but language links or buttons. Such pages tend to frustrate users and cause unnecessary delays.

10.3.3. Writing Link Texts

When referring to different language versions, it is essential how we choose the link texti.e., the "thing" that acts as a clickable or otherwise selectable part of a page, through which the link can be followed. In principle, that "thing" can be an image, too, but usually textual links work best. Especially in this context, it is not at all a good idea to use an image, since the most natural way to refer to a version of a document in another language is to use words, or maybe something else expressed as text. It is a particularly bad idea to use flags of countries as symbols for languages.

There are several alternatives that may work well for language links:

  • The name of the document in the language

  • The name of that language, in the language itself (or maybe in English)

  • A code for the language, such as a two-letter code (see Chapter 7)

  • A combination of the above

One possible exception to using text links is a situation where the link text would be in a language that cannot be presented reliably as text, due to character code problems. Thus, for example, when language names are used as link texts, it might be necessary to use an image to denote Arabic (but naturally one needs to specify a textual replacement for such an image too, using the alt attributee.g., alt="Arabic").

The choice depends on the number and nature of the languages involved, as well as on the context. In some situations, when there are many languages, two-letter or three-letters codes might be a suitable approach, even though people will have to learn to recognize the codes of the languages that are relevant to them. But it isn't that difficult to learn that en or eng stands for English. Figure 10-6 shows one set of links, using two-letter codes, pointing to versions of a page on the European Union (EU) site http://www.eu.int. As you can see, even this compact style requires considerable space. It is not intuitively clear, since the languages do not appear to be ordered by any apparent principle. (The secret order is by the native name of the language: castellano, etina, dansk,....) However, if the same order is used consistently, people learn to live with it. The approach of using codes has the benefit of requiring basic Latin letters only.

Unavoidably, when we use the names of the linked page in the different languages as link texts, we have to create a page with a mixture of languages, if only in the links. This affects the choice of the character encoding, as described in Table 10-3 (earlier in the chapter). Especially when several scripts (e.g., Latin and Greek) are mixed, UTF-8 may be the best

Figure 10-7. Using names of languages as link texts


Figure 10-8. Using names of the linked documents as link texts


option. However, since the link texts are typically relatively short, the use of ASCII and character references might be feasible, too.

Rather often, multilingual sites use drop-down menus for a language choice. This may sound suitable when there are many languages and even the two-letter codes would take too much space, in someone's opinion. However, drop-down menus on web pages suffer from usability problems, and their primary benefit (saving space by hiding information, until the menu is opened) is also their basic problem.

A rather verbose approach is illustrated in Figure 10-7, excerpted from a page of the Debian site http://www.debian.org. It uses the name of each language, in the language itself, as link text, with a Latin transcription in parentheses for languages that use a non-Latin script. The names are in alphabetic order by the version in Latin letters. (Chinese appears last, with a variant specifier in parentheses.) On the positive side, if you know any of the languages listed there, you can find the right link. The presentation is somewhat messy, because there are no separator characters between the links.

Yet another approach, which might be the best one for the main page of a multilingual site, is to use a list of links with the name of the page in each language as the link text. This is illustrated in Figure 10-8, which shows a part of the links on the main page of the EU. Each link is preceded by the two-letter code of the language, to help with identification. (The language codes could also be used as the basis for ordering the links.)

Technically, the language codes on the EU page are actually images, but they could be as well, or better, implemented as styled text. It is probably best to make the code part of the link, since a user might click on the code and not on the text. This means you could use HTML markup like the following, plus some CSS to style the appearance:

 <a href="index_cs.htm" hreflang="cs"><span >cs</span> <span lang="cs">Portál Evropské unie</span>

An advantage is that when someone who knows just one of the languages visits the page, he can both identify the link that is the right one for him and get an idea of what the site is about. As a disadvantage, such links are verbose, and the mixture of languages can be confusing, even alienating. This is one reason why language negotiation may help: when successful, it takes the user directly to the version he understands best.

The placement of the language links may vary. Putting them at the start (e.g., in the upper-right corner) makes them easy to note and use but may disturb in situations where the page is used linearly, and it may not fit to the visual design either. When placed at the end, they don't disturb much, but the user might notice them all too late, or not at all.

10.3.4. Language Negotiation in the HTTP Protocol

The language negotiation mechanism is based on the following idea:

  • When a browser sends a request to a server, it may specify the user's language preferences in a certain format.

  • If the resource that the browser asked for is available in different language versions, the server can be configured to select one of the versions according to the preferences mentioned earlier in this chapter.

At the level of the HTTP protocol, the browser sends an Accept-Language header, which lists the acceptable languages and their relative acceptability. More exactly, it lists the languages so that a language indicator (code) can be followed by a quality value, which is a number between 0 and 1, specifying the relative acceptability. For example, the header:

Accept-Language: fr;q=1, en;q=0.2

would say that both French (fr) and English (en) are acceptable, but French is much more acceptable. (This does not necessarily imply that the server always sends a French version, if it is available; a server could also consider the relative "goodness" of the versions.) The notation is a bit strange, since in it, the comma is a stronger separator than the semicolon; additional confusion can be caused by the rather common way of leaving a space after the semicolons.

10.3.5. Language Negotiation: the Server Side

It depends on the server and its settings whether and how an author can make versions of pages in different languages available via the language negotiation mechanism. Here we discuss only the methods that might be used in one widely used server software, Apache, and mainly just one of the two alternative methods there. For details, consult applicable server software documentation such as http://httpd.apache.org/docs/.

Apache has two basic methods for content negotiation:


Multiviews

The alternative versions are in the same directory, and they are named in some uniform way. The author specifies some general rule according to which a generic URL is to be mapped to filenames referring to different versions.


type-map

For each generic URL, there is a separate file that lists the corresponding language-specific filenames, possibly with some associated properties (e.g., the encoding of the file).

10.3.5.1. Using Multiviews

If Multiviews is enabled on Apache (as it is by default), you can use language negotiation in the following, though somewhat limited, manner for a directory:

  1. Add something like the following into the .htaccess file in a directory. Use the two-letter language code as the first argument in these directives, and use whichever suffix you like as the second argument:

    AddLanguage en .en       AddLanguage fi .fi       AddLanguage fr .fr

  2. Name the versions of a document so that the normal filename has the additional suffix as just definede.g., using foo.txt.en for the English version of foo.txt and foo.txt.fr for the French version. (You don't need to create a file named foo.txt.) Note that language negotiation works well for plain text files, too; the negotiation does not depend on the data format of the file.

  3. Now you can use a URL like http://www.cs.tut.fi/~jkorpela/multi/foo.txt as a generic URL that works via language negotiation. The specific language versions, like http://www.cs.tut.fi/~jkorpela/multi/foo.txt.fr, can be used too whenever desired.

10.3.5.2. Using type-map

The alternative method for content negotiation can perhaps best be described with a simple example. I have a document in Finnish http://www.cs.tut.fi/~jkorpela/rfct.html and a version of it in English http://www.cs.tut.fi/~jkorpela/rfcs.html. Into the directory where those files reside, I have written a file named .htaccess containing the line:

AddHandler type-map var

This makes the server handle URLs ending with .var in a special way. (This might be a system-default.) I have created, in that directory, a file named rfc.var and with the following content:

URI: rfcs.html Content-Type: text/html; charset=iso-8859-1 Content-Language: en URI: rfct.html Content-Type: text/html; charset=iso-8859-1 Content-Language: fi

This causes the URL http://www.cs.tut.fi/~jkorpela/rfc.var to become operational, so that the server will respond by sending a Finnish version or an English version, according to the language preference settings in the user's browser.

10.3.5.3. When negotiation fails

If a browser sends language preferences such that none of the versions is acceptable by them, Apache sends back the HTTP error code "406 Not Acceptable." By default, the text "Not Acceptable" will be shown to the user, along with a list of links to the alternative versions. The links are not very descriptive. This isn't user-friendly error handling.

There are different ways to improve the error handlinge.g., by creating a specific error page for the error code 406. The best option is, however, probably to append a generic alternative to the list: an alternative with no Content-Language specified. Such an alternative will be sent by the server as a response to a request that cannot be satisfied by any other alternative.

The generic alternative should be a page that explains the available alternatives in English, with their names in their own languages. The page could additionally, for the general benefit of the user, give the user some advice on setting his browser's language preferences at least by adding English there, if he understands English.

10.3.6. Language Negotiation: the Browser Side

In Chapter 7, we described the different meanings of "language settings" in software. We mentioned that one of the meanings is to set language preferences in browsers, and illustrated this a bit. It is probably a good idea to check your browser's language preferences now. On Internet Explorer, use Tools Internet options Languages. Note that on IE, you can select either a language genericallye.g., English (en)'or a country-specific variant, such as U.S. English (en-US). If you choose a specific variant, it is a good idea to select the language generically, too, as the next option.

The page "Debian web site in different languages," http://www.debian.org/intro/cn, contains generally useful instructions (in different languages) on setting language preferences in several browsers.

Most browsers send language preferences to the server according to an ordered list of languages in the browser settings. The browser computes, by some algorithm, quality values to be associated with the language codes, starting from 1 for the first one. For example, if you set the list of languages to Spanish (es), English (en), and Portuguese (pt), your browser might send the following (defaulting the q value to 1 for the first language):

es,en;q=0.9,pt;q=0.8

Typically the default setting in a browser is that the list consists of one language only, the "own" language of the browseri.e., the language used in its user interface (menus, buttons, error messages, etc). This naturally implies that if you install, say, an English version of a browser and do not change the language preferences, the settings say that you only know English. This usually isn't fatal, but it usually isn't optimal.

Problems may arise if the same computer and browser is used, at different timese.g., in a classroom by different people with different language preferences. There does not seem to be any simple solution to that at present. The systems could be configured to reset the settings to something generally reasonable at startup.

10.3.7. Notes on Multilingual Sites

Language negotiation deals with the technical problem of picking up and sending the best possible alternative among versions of a page in different languages. It does not perform any translation. Here we will briefly consider some such aspects. Many of them are discussed in more detail at http://webtips.dan.info/language.html.

10.3.7.1. Producing the translations

When producing different language versions, automatic translation programs might be used to some extent. However, a competent human translator should be responsible for the translation work. Optimally the human translator should know the basics of the HTML language so that he can produce the translation directly as an HTML document. That way, the material to be translated could be delivered in an HTML document, and the translator would replace the texts, leaving (usually) the HTML markup as it is.

As another alternative, the text could be given to the translator either as a plain text file or as displayed by a web browser, for example, as printed on paper. In the latter case, the translator could deduce some relevant information from the appearance of the text. On the other hand, HTML markup could better tell the intended structure of the document, which may have some significance in selecting between alternatives in the translation. In any case, if the translator sends only the translated text, then someone else has to put it into HTML format, in practice, by merging the text with HTML markup. This cannot be done without knowing the language of the translation to some extent.

When working with the HTML format, it is essential to specify the encoding of the documents. The encoding may be different for different languages. This is one reason why MS Word format is often used, since the encoding is normally not a problem there. Conversion from that format to HTML may require quite some work, though.

10.3.7.2. Translation or different content?

The versions of a page in different languages can be "pure translations" of each other; in practice, that would usually mean that one of the versions is the original one and other versions have been translated from it. A "pure translation" consists of the original document, with the content and form strictly preserved, just expressed in another language. This means, for example, that the translation also contains the same factual errors as the original, the same references to local states of affair, etc.

Quite often, a pure translation is not appropriate for the purposes of the page. On the other hand, it is not adequate to use a language negotiation mechanism to distribute documents with completely different content, just with the same topic. It is sometimes difficult to draw the line.

The specification of the language negotiation mechanism does not require that the versions be exactly equivalent. On the contrary, the mechanism contains the possibility of specifying quality values, which may result in a selection of a version in a language that is lower in the user's preferences than another available language, due to quality difference. For example, if the user knows German a little better than French, he could have specified this in his language preferences; if the server has a version of the requested document in German but also a considerably more up-to-date or more extensive version in French, it might respond by sending the latter. In practice, such situations are probably still rare, partly because popular browsers do not let the users control the quality values associated by languages, only the repertoire and ordering of languages in the user's preferences.

10.3.7.3. Indicating what is available in each language

When you have a multilingual site, it is crucial to tell people what is really available in different languages. For example, if your site is dominantly in German but has a few pages in English as well, you should make it very clear in the English version that it presents only a small part of the information available in German. Otherwise, a visitor who knows both languages but prefers English might never make real use of the site.

It is mostly sufficient to include such information in the main page in each language. But, for example, if the site contains a news page so that some but not all of the articles are available in German too, then it would be misleading to make the German version contain those articles only. Instead, the news page should minimally say that more news articles are available in English (naturally, the site should include a link with that English page). It could also contain links to English news articles that have not been translated, merged with the news in German. Preferably, the headlines of such news should appear as translated, along with a clear indication of the link pointing to text in English.

10.3.7.4. Naming the versions

When selecting URLs for versions of documents in different languages, a systematic approach is often desirable, for practical reasons like creating and maintaining the pages. This can be implemented in different ways; the method could, for example, be either of the following:

  • The path part of an address contains a separate part that specifies the languagee.g., http://www.something.example/en/foo.html (for an English version) and http://www.something.example/fi/foo.html (for a Finnish version). In practice, this usually corresponds to having pages in one language in a directory of their own.

  • At the end of an address, the part immediately preceding the .html (or equivalent) part contains a hyphen (or other punctuation character) and a language codee.g., http://www.something.example/foo-en.html and http://www.something.example/foo-fi.html. In practice, this usually corresponds to having pages in different languages in the same directory but with different names, according to a systematic naming scheme.

Both methods have the problem that the "proper name" of the document (in our example, "foo") should be reasonably understandable internationally. This typically means that you use English words there, partly because things are much easier if URLs contain only ASCII characters.

10.3.7.5. Language preferences and JavaScript

In the JavaScript language, it is under some conditions possible to determine the browser language. This however is almost always useless, and it has nothing to do with the user's language preferences. The browser language is just the language of the browser's user interface.

It is very common to use English versions of browsers just because there are no alternatives or because versions in other languages have confusing translations for terms. The basic use of a browser does not require much understanding of the browser language, since most of the basic functions can be activated using icon buttons or other simple tools so that it suffices to know a very small repertoire of words.

10.3.7.6. Making use of language preferences in CGI scripts

In CGI scripts, it is possible to use language preferences as sent by browsers. The value of Accept-Language header as defined in the protocol manifests itself to a CGI script as the environment variable HTTP_ACCEPT_LANGUAGE (which needs to be written this way, using uppercase letters).

According to the protocol, the value of this variable contains a comma-separated set of parts, each of which consists of a language code that is optionally followed by the specification of a q value. It is relatively easy to parse thise.g., in a CGI script written in Perl 'using the split function for division into parts. The following code sample performs this and sets the variable $preferred to the language code that corresponds to the language that is primary according to the preferences. Here we set English as the default language, to be implied, if the browser sends no preferences:

$accept = $ENV{'HTTP_ACCEPT_LANGUAGE'}; @prefs = split(/,/,$accept); $preferred = 'en'; $prefq = 0; foreach $pref(@prefs) {    if($pref =~ /(.*);q=(.*)/ ) {       $lang=$1; $qval=$2; }    else {       $lang=$pref; $qval= 1; }    if($qval > $prefq) {       $preferred = $lang; $prefq = $qval; }}

The result can be used, for example, to index a hash containing language-dependent strings. For example, if we would like to have a CGI script in Perl which, when dynamically generating an HTML document, to write texts either in Finnish or in English, we could write the alternate texts into a hash and pick up the right text from it as the following example shows:

$gen{'en'} = 'Report generated at '; $gen{'fi'} = 'Raportin luontihetki: ';  - - print "<div>$gen{$preferred} $now.</div>";

10.3.8. Types of Negotiation

Although we have concentrated on language negotiation, similar mechanisms work for other types of content negotiation, though normally without using user preferences:


Media type negotiation

You can make the same information available, for example, as plain text, in PDF format, and in HTML format. You could then use the type-map mechanism of Apache for language negotiation, and use different Content-Type headers. The browser is expected to list its media type preferences in an Accept header. This is not very useful in most cases, since browsers often express such preferences in a manner that contains too little information or cannot be trusted in practice.


Encoding negotiation

Similarly, you can make the same information available in different character encodings. Using the type-map mechanism for example, the Content-Type headers in your definition file would contain charset parameters that indicate the encoding of each version. The browser is expected to list its encoding preferences in an Accept-Charset header. However, many popular browsers do not send such a header at all, which means that they accept any encoding.


Transfer encoding negotiation

Additional transfer encoding (see Chapter 6) can be agreed upon between the browser and the server. A browser uses Accept-Encoding to specify the transfer encodings it can handle. Figure 10-9 shows how the Opera browser announces that it can handle deflate, gzip, and x-gzip but nothing else. It accepts "identity," which means no transfer encoding, but assigns a quality value of zero to everything else.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net