Recipe1.21.Converting Between Unicode and Plain Strings

Recipe 1.21. Converting Between Unicode and Plain Strings

Credit: David Ascher, Paul Prescod

Problem

You need to deal with textual data that doesn't necessarily fit in the ASCII character set.

Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

unicodestring = u"Hello world" # Convert Unicode to plain Python string: "encode" utf8string = unicodestring.encode("utf-8") asciistring = unicodestring.encode("ascii") isostring = unicodestring.encode("ISO-8859-1") utf16string = unicodestring.encode("utf-16") # Convert plain Python string to Unicode: "decode" plainstring1 = unicode(utf8string, "utf-8") plainstring2 = unicode(asciistring, "ascii") plainstring3 = unicode(isostring, "ISO-8859-1") plainstring4 = unicode(utf16string, "utf-16") assert plainstring1 == plainstring2 == plainstring3 == plainstring4

Discussion

If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicodewhat it is, how it works, and how Python uses it. The preceding Recipe 1.20 offers minimal but crucial practical tips, and this recipe tries to offer more perspective.

You don't need to know everything about Unicode to be able to solve real-world problems with it, but a few basic tidbits of knowledge are indispensable. First, you must understand the difference between bytes and characters. In older, ASCII-centric languages and environments, bytes and characters are treated as if they were the same thing. A byte can hold up to 256 different values, so these environments are limited to dealing with no more than 256 distinct characters. Unicode, on the other hand, has tens of thousands of characters, which means that each Unicode character takes more than one byte; thus you need to make the distinction between characters and bytes.

Standard Python strings are really bytestrings, and a Python character, being such a string of length 1, is really a byte. Other terms for an instance of the standard Python string type are 8-bit string and plain string. In this recipe we call such instances bytestrings, to remind you of their byte orientation.

A Python Unicode character is an abstract object big enough to hold any character, analogous to Python's long integers. You don't have to worry about the internal representation; the representation of Unicode characters becomes an issue only when you are trying to send them to some byte-oriented function, such as the write method of files or the send method of network sockets. At that point, you must choose how to represent the characters as bytes. Converting from Unicode to a bytestring is called encoding the string. Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters.

Converting Unicode objects to bytestrings can be achieved in many ways, each of which is called an encoding. For a variety of historical, political, and technical reasons, there is no one "right" encoding. Every encoding has a case-insensitive name, and that name is passed to the encode and decode methods as a parameter. Here are a few encodings you should know about:

The UTF-8 encoding can handle any Unicode character. It is also backwards compatible with ASCII, so that a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters. This property makes UTF-8 very backwards-compatible, especially with older Unix tools. UTF-8 is by far the dominant encoding on Unix, as well as the default encoding for XML documents. UTF-8's primary weakness is that it is fairly inefficient for eastern-language texts.
The UTF-16 encoding is favored by Microsoft operating systems and the Java environment. It is less efficient for western languages but more efficient for eastern ones. A variant of UTF-16 is sometimes known as UCS-2.
The ISO-8859 series of encodings are supersets of ASCII, each able to deal with 256 distinct characters. These encodings cannot support all of the Unicode characters; they support only some particular language or family of languages. ISO-8859-1, also known as "Latin-1", covers most western European and African languages, but not Arabic. ISO-8859-2, also known as "Latin-2", covers many eastern European languages such as Hungarian and Polish. ISO-8859-15, very popular in Europe these days, is basically the same as ISO-8859-1 with the addition of the Euro currency symbol as a character.

If you want to be able to encode all Unicode characters, you'll probably want to use UTF-8. You will need to deal with the other encodings only when you are handed data in those encodings created by some other application or input device, or vice versa, when you need to prepare data in a specified encoding to accommodate another application downstream of yours, or an output device. In particular, Recipe 1.22 shows how to handle the case in which the downstream application or device is driven from your program's standard output stream.

Recipe1.21.Converting Between Unicode and Plain Strings