Section 4.2. Coding in a Post-ASCII World


4.1. Background and Terminology

In the "bad old days" of computing, roughly contemporaneous with the use of punched cards, there was a proliferation of character sets. Fortunately, those days are largely forgotten after the emergence of ASCII in the 1970s.

ASCII stands for American Standard Code for Information Interchange. It was a big step forward, but the operant word here is American. It was never designed to handle even European languages much less Asian ones.

But there were loopholes. This character set had 128 characters (being a 7-bit code). But an 8-bit byte was standard; how could we waste that extra bit? The natural idea is to make a superset of ASCII, using the codes 128 through 255 for other purposes. The trouble is, this was done many times in many different ways by IBM and others. There was no widespread agreement on what, for example, character 221 was.

The shortcomings of such an approach are obvious. Not only do the sender and receiver have to agree on the exact character set, but they are limited in what languages they can use all at once. If you wanted to write in German but quote a couple of sources in Greek and Hebrew, you probably couldn't do it at all. And this scheme didn't begin to address the problems of Asian languages such as Chinese, Japanese, and Korean.

There were two basic kinds of solutions to this problem. One was to use a much larger character setone with 16 bits, for example (so-called wide characters). The other was to use variable-length multibyte encodings. In such a scheme, some characters might be represented in a single byte, some in two bytes, and some in three or even more. Obviously this raised many issues: For one, a string had to be uniquely decodable. The first byte of a multibyte character could be in a special class so that we could know to expect another byte, but what about the second and later bytes? Are they allowed to overlap with the set of single-byte characters? Are certain characters allowed as second or third bytes, or are they disallowed? Will we be able to jump into the middle of a string and still make sense of it? Will we be able to iterate backwards over a string if we want? Different encodings made different design decisions.

Eventually the idea for Unicode was born. Think of it as a "world character set." Unfortunately, nothing is ever that simple.

You may have heard it said that Unicode was (or is) limited to 65,536 characters (the number that can be stored in 16 bits). This is a common misconception. Unicode was never designed with that kind of constraint; it was understood from the beginning that in many usages, it would be a multibyte scheme. The number of characters that can be represented is essentially limitlessa good thing, because 65,000 would never suffice to handle all the languages of the world.

One of the first things to understand about I18N is that the interpretation of a string is not intrinsic to the string itself. That kind of old-fashioned thinking comes from the notion that there is only one way of storing strings.

I can't stress this enough. Internally, a string is just a series of bytes. To emphasize this, let's imagine a single ASCII character stored in a byte of memory. If we store the letter that we call "capital A," we really are storing the number 65.

Why do we view a 65 as an A? It's because of how the data item is used (or how it is interpreted). If we take that item and add it to another number, we are using it (interpreting it) as a number; if we send it to an ASCII terminal over a serial line, we are interpreting it as an ASCII character.

Just as a single byte can be interpreted in more than one way, so obviously can a whole sequence of bytes. In fact, the intended interpretation scheme (or encoding) has to be known in advance for a string to make any real sense. An encoding is simply a mapping between binary numbers and characters. And yet it still isn't quite this simple.

Because Ruby originated in Japan, it handles two different Japanese encodings (as well as ASCII) very well. I won't spend much time on Japanese; if you are a Japanese reader, you have access to a wide variety of Ruby books in that language. For the rest of us, Unicode is the most widely usable encoding. This chapter focuses on Unicode.

But before we get too deeply into these issues, let's look at some terminology. Calling things by useful names is one of the foundations of wisdom.

  • A byte is simply eight bits (though in the old days, even this was not true). Traditionally many of us think of a byte as corresponding to a single character. Obviously we can't think that way in an I18N context.

  • A codepoint is simply a single entry in the imaginary table that represents the character set. As a half-truth, you may think of a codepoint as mapping one-to-one to a character. Nearer to the truth, it sometimes takes more than a single codepoint to uniquely specify a character.

  • A glyph is the visual representation of a codepoint. It may seem a little unintuitive, but a character's identity is distinct from its visual representation. (I may open my word processor and type a capital A in a dozen different fonts, but I still name each of them A.)

  • A grapheme is similar in concept to a glyph, but when we talk about graphemes, we are coming from the context of the language, not the context of our software. A grapheme may be the combination (naive or otherwise) of two or more glyphs. It is the way a user thinks about a character in his own native language context. The distinction is subtle enough that many programmers will simply never worry about it.

What then is a character? Even in the Unicode world, there is some fuzziness associated with this concept because different languages behave a little differently and programmers think differently from other people. Let's say that a character is an abstraction of a writing symbol that can be visually represented in one or more ways.

Let's get a little more concrete. First, let me introduce a notation to you. We habitually represent Unicode codepoints with the notation U+ followed by four or more uppercase hexadecimal digits. So what we call the letter "A" can be specified as U+0041.

Now take the letter "é" for example (lowercase e with an acute accent). This can actually be represented in two ways in Unicode. The first way is the single codepoint U+00E9 (LATIN SMALL LETTER E WITH ACUTE). The second way is two codepointsa small e followed by an acute accent: U+0065 and U+0301 (or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT).

Both forms are equally valid. The shorter one is referred to as the precomposed form. Bear in mind, though, that not every language has precomposed variants, so it isn't always possible to reduce such a character to a single codepoint.

I've referred to Unicode as an encoding, but that isn't strictly correct. Unicode maps characters to codepoints; there are different ways to map codepoints to binary storage. In effect, Unicode is a family of encodings.

Let's take the string "Matz" as an example. This consists of four Unicode codepoints:

"Matz"    # U+004d U+0061 U+0074 U+007a


The straightforward way to store this would be as a simple sequence of bytes.

00 4d 00 61 00 74 00 7a


This is called UCS-2 (as in two bytes) or UTF-16 (as in 16 bits). Note that this encoding itself actually comes in two "flavors," a big-endian and a little-endian form.

However notice that every other byte is zero. This isn't mere coincidence; it is typical for English text, which rarely goes beyond codepoint U+00FF. It's somewhat wasteful of memory.

This brings us to the idea of UTF-8. This is a Unicode encoding where the "traditional" characters are represented as single bytes, but others may be multiple bytes. Here is a UTF-8 encoding of this same string:

4d 61 74 7a


Notice that all we have done is strip off the zeroes; more importantly, note that this is the same as ordinary ASCII. This is obviously by design; "plain ASCII" can be thought as a proper subset of UTF-8.

One implication of this is that when UTF-8 text is interpreted as ASCII, it sometimes appears "normal" (especially if the text is mostly English). Sometimes you may find that in a browser or other application English text is displayed correctly, but there are additional "garbage" characters. In such a case, it's likely that the application is making the wrong assumption about what encoding is being used.

So we can argue that UTF-8 saves memory. Of course, I'm speaking from an Anglocentric point of view again (or at least ASCII-centric). When the text is primarily ASCII, memory will be conserved, but for other writing systems such as Greek or Cyrillic, the strings will actually grow in size.

Another obvious benefit is that UTF-8 is "backward compatible" with ASCII, still arguably the most common single-byte encoding in the world. Finally, UTF-8 also has some special features to make it convenient for programmers.

For one thing, the bytes used in multibyte characters are assigned carefully. The null character (ASCII 0) is never used as the nth byte in a sequence (where n > 1), nor are such common characters as the slash (commonly used as a pathname delimiter). As a matter of fact, no byte in the full ASCII range (0x00-0x7F) can be used as part of any other character.

The second byte in a multibyte character uniquely determines how many bytes will follow. The second byte is always in the range 0xC0 to 0xFD, and any following bytes are always in the range 0x80 to 0xBF. This ensures that the encoding scheme is stateless and allows recovery after missing or garbled bytes.

UTF-8 is one of the most flexible and common encodings in the world. It has been in use since the early 1990s and is the default encoding for XML. Most of our attention in this chapter will be focused on UTF-8.




The Ruby Way(c) Solutions and Techniques in Ruby Programming
The Ruby Way, Second Edition: Solutions and Techniques in Ruby Programming (2nd Edition)
ISBN: 0672328844
EAN: 2147483647
Year: 2004
Pages: 269
Authors: Hal Fulton

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net