Recipe1.20.Handling International Text with Unicode

Recipe 1.20. Handling International Text with Unicode

Credit: Holger Krekel

Problem

You need to deal with text strings that include non-ASCII characters.

Solution

Python has a first class unicode type that you can use in place of the plain bytestring str type. It's easy, once you accept the need to explicitly convert between a bytestring and a Unicode string:

>>> german_ae = unicode('\xc3\xa4', 'utf8')

Here german_ae is a unicode string representing the German lowercase a with umlaut (i.e., diaeresis) character "ae". It has been constructed from interpreting the bytestring '\xc3\xa4' according to the specified UTF-8 encoding. There are many encodings, but UTF-8 is often used because it is universal (UTF-8 can encode any Unicode string) and yet fully compatible with the 7-bit ASCII set (any ASCII bytestring is a correct UTF-8-encoded string).

Once you cross this barrier, life is easy! You can manipulate this Unicode string in practically the same way as a plain str string:

>>> sentence = "This is a " + german_ae >>> sentence2 = "Easy!" >>> para = ". ".join([sentence, sentence2])

Note that para is a Unicode string, because operations between a unicode string and a bytestring always result in a unicode stringunless they fail and raise an exception:

>>> bytestring = '\xc3\xa4'     # Uuh, some non-ASCII bytestring! >>> german_ae += bytestring UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in  position 0: ordinal not in range(128)

The byte '0xc3' is not a valid character in the 7-bit ASCII encoding, and Python refuses to guess an encoding. So, being explicit about encodings is the crucial point for successfully using Unicode strings with Python.

Discussion

Unicode is easy to handle in Python, if you respect a few guidelines and learn to deal with common problems. This is not to say that an efficient implementation of Unicode is an easy task. Luckily, as with other hard problems, you don't have to care much: you can just use the efficient implementation of Unicode that Python provides.

The most important issue is to fully accept the distinction between a bytestring and a unicode string. As exemplified in this recipe's solution, you often need to explicitly construct a unicode string by providing a bytestring and an encoding. Without an encoding, a bytestring is basically meaningless, unless you happen to be lucky and can just assume that the bytestring is text in ASCII.

The most common problem with using Unicode in Python arises when you are doing some text manipulation where only some of your strings are unicode objects and others are bytestrings. Python makes a shallow attempt to implicitly convert your bytestrings to Unicode. It usually assumes an ASCII encoding, though, which gives you UnicodeDecodeError exceptions if you actually have non-ASCII bytes somewhere. UnicodeDecodeError tells you that you mixed Unicode and bytestrings in such a way that Python cannot (doesn't even try to) guess the text your bytestring might represent.

Developers from many big Python projects have come up with simple rules of thumb to prevent such runtime UnicodeDecodeErrors, and the rules may be summarized into one sentence: always do the conversion at IO barriers. To express this same concept a bit more extensively:

Whenever your program receives text data "from the outside" (from the network, from a file, from user input, etc.), construct unicode objects immediately. Find out the appropriate encoding, for example, from an HTTP header, or look for an appropriate convention to determine the encoding to use.
Whenever your program sends text data "to the outside" (to the network, to some file, to the user, etc.), determine the correct encoding, and convert your text to a bytestring with that encoding. (Otherwise, Python attempts to convert Unicode to an ASCII bytestring, likely producing UnicodeEncodeErrors, which are just the converse of the UnicodeDecodeErrors previously mentioned).

With these two rules, you will solve most Unicode problems. If you still get UnicodeErrors of either kind, look for where you forgot to properly construct a unicode object, forgot to properly convert back to an encoded bytestring, or ended up using an inappropriate encoding due to some mistake. (It is quite possible that such encoding mistakes are due to the user, or some other program that is interacting with yours, not following the proper encoding rules or conventions.)

In order to convert a Unicode string back to an encoded bytestring, you usually do something like:

>>> bytestring = german_ae.decode('latin1') >>> bytestring '\xe4'

Now bytestring is a German ae character in the 'latin1' encoding. Note how '\xe4' (in Latin1) and the previously shown '\xc3\xa4' (in UTF-8) represent the same German character, but in different encodings.

By now, you can probably imagine why Python refuses to guess among the hundreds of possible encodings. It's a crucial design choice, based on one of the Zen of Python principles: "In the face of ambiguity, resist the temptation to guess." At any interactive Python shell prompt, enter the statement import this to read all of the important principles that make up the Zen of Python.

Recipe1.20.Handling International Text with Unicode