This chapter describes how character encoding information is transmitted in Internet protocols, including MIME and HTTP, and how content negotiation works on the Web, mainly for the purposes of negotiating on character encoding and language. This constitutes a basis for a presentation of some fundamentals of multilingual web authoring at the technical level. Moreover, the use of characters in the protocols themselves, such as in Internet message headers and URLs, is described, with focus on the partial shift from pure ASCII to Unicode. In particular, the technical basis of Internationalized Domain Names and Internationalized URLs is described.
A common situation in which people first encounter problems with character encoding is when they start authoring web pages in new languages. If you have a web site in English, you might never think about encodings, since you can work with default settings. Then, if you want to add a page in Japanese or Arabic, you meet several problems at a time:
What authoring tools (software) should I use?
What fonts do I use?
Which encoding should I use?
How do I give information about the encoding?
What tags should I put in my documents to tell the language I'm using?
Many of the difficulties in such situations arise from the common confusion of fonts, encodings, and languages. Other chapters of this book have explained such issues; in this chapter, we mostly concentrate on the encodings. A suitable approach is:
Determine the character repertoire that you will need (see Chapter 7). Consider both the needs of the language(s) you use and the special symbols that might appear.
Select a suitable encoding that covers that repertoire and is suitable for use on the Web. Chapters 3 and 6 have described the encodings, but in this chapter, we consider the special conditions of web publishing. In particular, it is possible to use an encoding that does not support all the characters needed, since you can use special notations like character references to overcome the limitations of an encoding.
Select software that lets you work conveniently with the encoding and with the characters you need. In practice, you may need to consider what software is available before you decide on the encoding. Such topics were discussed in Chapter 2.
Make sure that the web server sends information about the encoding in one way or another, and possibly in different ways. This is explained in this chapter.
Use language markup if you know how to use it properly, but do not rely on it. It mostly has no effect except possibly on typography (font selection) on some browsers. See Chapter 7.
Worry about fonts if you wish or need to, but do not think that font settings solve any of the fundamental problems listed here. Rather, setting fonts is like painting a house, once you have otherwise built it up. Font issues mostly do not belong to the scope of this book. You would normally use Cascading Style Sheets (CSS) to affect fonts in web authoring, but you might also create a PDF version of a document, with fonts embedded into it.