Section 1.8. Working with Encodings


1.8. Working with Encodings

When you use characters on a computer, some software will internally encode them in binary format. Most users never need to know the details of this, still less need to actually handle the encoding process, but it is essential to know that there are different encodings, with different properties. In transferring data between applications and computers, you may need to change the encoding or select a suitable encoding.

1.8.1. Selecting the Encoding When Saving

Text editors and many other programs typically have a File menu, with a Save function for storing data onto disk. Normally, this function uses the file format and the character encoding that is typical of the program. However, there is usually also a Save As function, which lets the user select the format and encoding. This function is often used because it lets you save an edited document under a different filename.

The Save As function is often the simplest way to convert between different encodings (and file formats). You simply open a file and save it differently. For example, suppose you have used Notepad to create a plain text file. If you use, for example, an English version of Windows, the default encoding that Notepad uses is Windows Latin 1. Now suppose that a friend has asked you to send your text in the UTF-8 encoding for some reason. You simply open your file in Notepad, select File Save As and then choose the UTF-8 encoding from the menu of encodings, as shown in Figure 1-13. It illustrates the three basic things you can (and need to) specify in Save As dialogs: the filename, the file format, and the encoding.

The list of possible encodings in a Save As dialog varies greatly, and the names of the encodings are not always official names. For example, in Microsoft products, "ANSI" often appears as denoting the character code that the system uses as its normal 8-bit code, such as the Windows Latin 1 encoding, which should be called "windows-1252." The word "Unicode" may denote different encodings used for Unicode, typically UTF-16. Use the UTF-8 encoding for Unicode text, unless you have a good reason for doing otherwise.

When using a text-processing program, the situation is usually different. There is a file format menu in the Save As dialog but often no encoding menu. The reason is that in text processing, the overall format is crucial, and the encoding is often coupled with the format.

Figure 1-14. An extract from a Save As dialog in Microsoft Word


In Microsoft Word, for example, the list of formats may contain alternatives as shown in Figure 1-14, with options corresponding to the internal formats of different programs and some plain text formats. Here, too, it may require some guesswork or study to identify what the options really mean. On Windows systems, "*.txt" is associated with several different encodings, and "*.ans" refers to ANSI (e.g., windows-1252). The notation "*.asc" may suggest ASCII encoding, but in fact it refers to an old DOS encoding, a code page, which is a single-octet encoding and may vary from one system to another.

Having selected a plain text (*.txt) format, modern versions of Microsoft Word ask you to specify the encoding in another dialog. In older versions, this happens if you select the "Encoded text" format. In this mode, the default is "Windows" or, more explicitly, something like "Western European (Windows)," which means windows-1252. The dialog is shown in Figure 1-15. The user has typed in the text "This is a sample documentwith special characters like and ." When saving as windows-1252, Microsoft Word is about to quietly change the em dash "'" to hyphen "-" (for some odd reason) and to omit the two special characters, but it issues a warning about them. If you would like to have them saved, you would need to select an encoding that makes this possible, such as UTF-8.

In Save As dialogs, there are often additional settings that affect line break conventions, which are discussed in Chapter 8. These conventions specify which control characters are used to separate lines of text, as well as the method of presenting paragraphs internally. Microsoft Word stores a paragraph as one long line and splits it to separate lines as needed for display. It is often desirable to split a paragraph into lines of reasonable length (e.g., at most 80 characters) when saving as plain text.

1.8.2. How Encodings Should Be Detected

Character encodings are of crucial importance, but most peopleincluding most computer professionalsneed not know the technicalities of encodings. To view an email message or a web page that is UTF-8 or ISO 8859-2 encoded, you need not know how characters are encoded in them. Instead, you need a program that understands the encodings, and perhaps you need to tell it to use a particular encoding.

The correct interpretation and processing of character data of course requires knowledge (or correct guess) about the encoding used. For email, the encoding should be specified (by the email program) in so-called MIME headers, unless ASCII, the default encoding,

Figure 1-15. Selecting encoding when saving as plain text in Microsoft Word


is used. For HTML documents, such information should be sent by the web server along with the document itself, using HTTP headers, which resemble MIME headers. The headers are normally invisible to users but processed by a program, such as an email reader or web browser. Using special tools, the headers can be made visible for an analysis.

Thus, when everything works well, you need not see MIME or HTTP headers or care about them. But if things look odd, you may need inspect them or at least force a program make a particular guess on the character encoding. In some situations, you might have some prose description of the data format, such as an email sender's note like "the attached file is in ISO-8859-2." Beware that people don't always use the right terms in such notes.

Previously the ASCII encoding was usually implied by default, and it is still very common to do so. Nowadays ISO-8859-1, which can be regarded as an extension of ASCII, is often the practical default. The current trend is to avoid giving such a special position to ISO-8859-1 among the variety of encodings. In XML, the default encoding is UTF-8.

To summarize, the character encoding of input data can be deduced from:

  • An explicit indication of the encodinge.g., in protocol headers

  • An explicit or implicit agreement on using a particular encoding by default in a certain context

  • A private agreement or note about the encoding in a particular case

  • Guesswork based on the context or inspection of the data using different guesses

In Chapter 3, we will discuss some commonly used encodings and their typical scope of use. This will help you in the guesswork. For example, if you get an email message from Poland and it contains some Polish names that look misspelled, the odds are good that the

Figure 1-16. Fixing the display of an email message by setting the character encoding manually


message is in fact in ISO-8859-2 or windows-1250 encoding, since these are very common in Poland.

1.8.3. Setting the Encoding Manually

Suppose you get email from abroad and it contains some strange characters in names or in other text. Figure 1-16 shows an example of a received email message, as displayed by the Mozilla Thunderbird email program. The message is meant to contain French words like "Rhône" and "moiré" but is displayed incorrectly, with Greek letters in place of accented Latin letters. The sender may have seen text all right, but something went wrong, and the error is not the recipient's email program. The reason is that the message was incorrectly sent with a message header that claims that it is encoded in ISO-8859-7, as we can see by selecting View Character Encoding. Clicking on "Western (ISO-8859-1)" fixes the display.

Setting the encoding manually in the recipient's email program does not always help. For example, if a message has incorrect information about encoding, it may be converted to another encoding before it reaches the recipient. Since the information is wrong, the conversion goes all wrong too, and special repair might be needed.

1.8.4. Sending Unicode Email

Before sending Unicode email, make sure the recipient is willing to receive Unicode-encoded messages and knows what that means. Although most users have email programs that are capable of displaying such messages, the user may need to change settings (especially font settings) to see them properly. Moreover, on programs that cannot handle Unicode, the message would look more or less like garbage.

It's a good idea to test things by sending email to yourself. There are things that can go wrong in that simple case, and it's best that only you see your own initial mistakes. However, many problems will not be detected that way. If possible, find someone who works in a different environment (say, Mac or Unix, if you are using Windows) and uses a different email program, and exchange some test messages with Unicode characters in them.

There are basically three ways to send Unicode text by email:

  • As an attachmente.g., in Microsoft Word format. This is usually no different from using a "normal," non-Unicode attachment. The recipient needs to know what to do with the attachment. Beware that attachments are often frowned upon for security reasons, and they might even be filtered out by firewalls.

  • In HTML format, typically as generated by an email program. Effectively, the program would convert "special" characters to HTML character references. This is what typically happens when you try to compose an email message with special characters in Outlook Express. Although this may solve some problems, it also causes some. HTML format messages cannot be read by all programs, and they may affect the classification of your message as unsolicited bulk email, or "spam."

  • As plain text, with message headers that specify the encoding (as explained in detail in Chapter 10). This is a simple and clean approach. It's very easy on Mozilla Thunderbird, for example. If you don't have that program on your system, you can download it from the http://www.mozilla.org site.

Figure 1-17 shows a dialog that appears when you have composed a Unicode email message in Thunderbird. The program asks for permission to send the message as UTF-8 encoded, which is just fine. In composing a message, you can use, for example, the Character Map program when using Windows, as explained in "Introduction to Characters and Unicode" earlier in this chapter.

Sent this way, the email message is plain text, effectively just a sequence of characters as Unicode code points, though with headers that specify the encoding. This means, in particular, that there is no font information included. It is up to the receiving email program to use the font(s) it has been set to use. The recipient needs to have some font that contains the character you have included, but she does not need to have the same font as you. This is essential for communication between people who work on different platforms, often with quite different choices of fonts.

Figure 1-17. Sending Unicode email in Thunderbird


Outlook Express (OE) may automatically convert the message into HTML format, if your text contains characters outside the repertoire that OE normally supports. To prevent this, go to the settings for outgoing mail in the Tools menu, and check that the plain text format is selected. Note that there are separate settings for outgoing email and for outgoing newsgroup (Usenet) messages. Check also the options for text format in the settings: make sure "MIME" is checked, and select "no encoding" instead of an encoding like Quoted Printable or Base64. When you then send email with special characters, OE asks how to send the message; select sending as Unicode. As you see, OE is less convenient for getting started with Unicode email than Thunderbird but once you've found the right settings, OE works well for Unicode.

1.8.5. Viewing Web Pages in Different Encodings

A web page author can specify the character encoding of her page in several ways, discussed in Chapter 10. Normally, your web browser recognizes the encoding and uses it to interpret and display the page. However, sometimes an author fails to use any of those ways or specifies a wrong encoding. Then the user may need to select the encoding, perhaps with trial and error, until the page becomes legible.

Your browser might not be prepared to handle all encodings. For Internet Explorer in particular, there is a set of updates available from the Windows Update site http://windowsupdate.microsoft.com. In addition to updates that fix security problems, the site contains optional updates that add some encodings to the capabilities of Internet Explorer (IE). The site http://www.mauvecloud.net/charsets/ contains pages for testing browser support to encodings.

If you visit web pages in many languages, you will probably encounter some pages that are not displayed correctly due to encoding mismatches. For example, you might visit a Hungarian page and see most characters correctly but some letters all wrong. If you have problems with finding such problems, try the index http://www.dmoz.org/World/, which contains links to collections of web pages in different languages; the link names are in the language itself.

The explanation is probably that the web server has not sent any information about the encoding. In Hungary, web browsers are probably configured to use ISO-8859-2 in such cases, and users do not observe any problem. However, your browser might use ISO-8859-1 by default, and this makes a difference for a handful of characters. For example, the octet that denotes (u with double acute accent) according to ISO-8859-2 will be treated as û (u with circumflex) according to ISO-8859-1.

What you can do as a user is to tell your browser to use an encoding that differs from the browser default:

  • On IE, select View Encoding, and then the appropriate encoding. Use the "More option when needed. In the example, you would select "Central European (ISO)," which is what Microsoft calls ISO-8859-2.

  • On Firefox, select View Character Encoding, and choose the suitable encoding directly or via "More Encodings," as illustrated in Figure 1-18. Firefox classifies, for example, ISO-8859-2 as "East European and calls it "Central European (ISO-8859-2)."

Often you can fix the display of a web page rather easily, since you can guess the encoding. This requires some experience, though. For example, Hungarian pages are most probably in ISO-8859-2 or in windows-1250 encoding; but for Russian pages, there are a few encodings (in the Cyrillic group) you might need to try.

1.8.6. Common Confusion: Encoding Versus Language

Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. Programs typically confuse their users quite a lot in this area.

On the Opera browser (available from http://www.opera.com), for example, the keyboard shortcut Alt-P (or the command Tools Preferences) and the choice of the "General pane takes you to settings titled "Languages," as shown in Figure 1-19. The pane contains settings for three quite independent things:

  • The user interface languagei.e., the names of menus and options in the browser itself, quite independently of any particular page content.

  • The language preferences sent by the browser. You can specify an ordered sequence of languages, to be used in the (rare) cases where a web page is served in different language versions using a particular protocol.

  • The default encoding, to be used when a web page fails to specify its encoding in any explicit manner. The encoding windows-1252 is suitable here if you mainly view pages in English and other Western European languages. However, the encoding itself is a technical setting and does not depend on any language settings.

Figure 1-18. Setting encoding for a page in the Firefox browser


All these settings are useful, but lumping them together into one pane called "Languages" is misleading.

A language setting is quite distinct from character issues, although naturally each language has its own requirements for character repertoire. Even more seriously, programs and their documentation very often confuse the above-mentioned issues with the selection of a font.



Unicode Explained
Unicode Explained
ISBN: 059610121X
EAN: 2147483647
Year: 2006
Pages: 139

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net