Section 4.3. Using Message Catalogs

4.2. Coding in a Post-ASCII World

The "age of ASCII" is gone, though some have not realized it yet. Many assumptions commonly made by programmers in the past are no longer true. We need a new way of thinking.

There are two concepts that I consider to be fundamental, almost axiomatic. First, a string has no intrinsic interpretation. It must be interpreted according to some external standard. Second, a byte does not correspond to a character; a character may be one or more bytes. There are other lessons to be learned, but these two come first.

These facts may sometimes affect our programming in subtle ways. Let's examine in detail how to handle character strings in a modern fashion.

4.2.1. The `jcode` Library and `$KCODE`

To use different character sets in Ruby, you must first be aware of the global variable $KCODE, which determines the behavior of many core methods that manipulate strings. (The K, by the way, comes from kanji, which are the Japanese nonalphabetic writing symbols.) There are five usual settings for this variable; each is a single case-insensitive letter ("ASCII" and "NONE" are the same).

a    ASCII n    NONE (ASCII) e    EUC s    SJIS u    UTF-8

Actually, you can use a full description for clarity if you want (for example, $KCODE = "UTF-8"). Only the first character of the string is significant.

ASCII we already know about. EUC and Shift-JIS (SJIS) are of minimal interest to us here. We'll concentrate on the "utf-8" setting.

After you set $KCODE, you get a lot of functionality for free. For example, the inspect method (called automatically when you invoke the p method to print an object in readable form) will typically honor the $KCODE setting.

$KCODE = "n" # In case you didn't know, the French word "épée" # refers to a kind of sword. eacute = "" eacute << 0303 << 0251               # U+00E9 sword = eacute + "p" + eacute + "e" p eacute                             # "\303\251" p sword                              # "\303\251p\303\251e" $KCODE = "u" p eacute                             # "é" p sword                              # "épée"

Regular expressions also become a little smarter in UTF-8 mode.

$KCODE = "n" letters = sword.scan(/(.)/) # [["\303"], ["\251"], ["p"], ["\303"], ["\251"], ["e"]] puts letters.size                    # 6 $KCODE = "u" letters = sword.scan(/(.)/) # [["é"], ["p"], ["é"], ["e"]] puts letters.size                    # 4

The jcode library also provides some useful methods such as jlength and each_char. It's not a bad idea to require this library anytime you use UTF-8.

In the next section, we'll revisit common operations with strings and regular expressions. We'll learn more about jcode there.

4.2.2. Revisiting Common String and Regex Operations

When using UTF-8, some operations work exactly as before. Concatenation of strings is unchanged:

"ép" + "ée"    # "épée" "ép" << "ée"   # "épée"

Because UTF-8 is stateless, checking for the presence of a substring requires no special considerations either:

"épée".include?("é")    # true

However, some common assumptions require rethinking when we internationalize. Obviously a character is no longer always a byte. When we count characters or bytes, we have to consider what we really want to count and why. The same is true for iteration.

There is a convention that a codepoint is sometimes thought of as a "programmer's character." This is another half-truth, but one that is sometimes useful.

The jlength method will return the actual number of codepoints in a string rather than bytes. If you actually want bytes, you can still call the length method.

$KCODE = "u" require 'jcode' sword = "épée" sword.jlength      # 4 sword.length       # 6

Methods such as upcase and capitalize will typically fail with special characters. This is a limitation in current Ruby. (It is not really appropriate to view this as a bug because capitalization in general is a complex issue and simply isn't handled in internationalized Ruby. Consider it a set of missing features.)

$KCODE = "u" sword.upcase       # "éPéE" sword.capitalize   # "épée"

If you are using a decomposed form, this might appear to work in some cases, as the Latin letters are separated from the diacritics. But in the general case it won't work; it will fail for Turkish, German, Dutch, and any other language where the capitalization rules are nontrivial.

You might think that unaccented characters would be treated as equivalent in some sense to their unaccented counterparts. In general, this is never true. They're simply different characters. Here's an example with count:

$KCODE = "u" sword.count("e")   # 1 (not 3)

Again, the opposite is true for decomposed characters. The Latin letter is detected in that case.

Similarly, count will return a misleading result when passed a multibyte character. The jcount method will handle the latter case, however.

$KCODE = "u" sword.count("eé")   # 5 (not 3) sword.jcount("eé")  # 3

There is a convenience method mbchar?, which detects whether a string has any multibyte characters in it.

$KCODE = "u" sword.mbchar?    # 0  (offset of first multibyte char) "foo".mbchar?    # nil

The jcode library also redefines such methods as chop, delete, squeeze, succ, tr, and tr_s. Anytime you use these in UTF-8 mode, be aware you are using the "multibyte-aware" version. If you handle multibyte strings without the jcode library, you may get surprising or erroneous results.

We can iterate over a string by bytes as usual; or we can iterate by characters using each_char. The latter method deals with single-character strings; the former (in current versions of Ruby) deals with single-byte integers.

Of course, we're once again equating a codepoint with a character. Despite the name, each_char actually iterates over codepoints, strictly speaking, not characters.

$KCODE = "u" sword.each_byte {|x| puts x }   # Six lines of integers sword.each_char {|x| puts x }   # Four lines of strings

If you're confused, don't feel bad. Most of us are. I've attempted to summarize the situation in Table 4.1.

Table 4.1. Precomposed and Decomposed Forms
Precomposed Form of "é"
Character name	Glyph	Codepoint	UTF-8 Bytes	Comments
LATIN SMALL LETTER E WITH ACUTE	é	U+00E9	0xC3 0xA9	One character, one codepoint, two UTF-8 bytes
Decomposed Form of "é"
Character name	Glyph	Codepoint	UTF-8 Bytes	Comments
LATIN SMALL LETTER E	e	U+0065	0x65	One character, two codepoints (two "programmer's characters"), three UTF-8 bytes
COMBINING ACUTE ACCENT	´	U+0301	0xCC 0x81

What else do we need to consider with internationalized strings? Obviously the "bracket" notation still refers to bytes, not characters. But you could change this if you wanted. Here is one implementation (not especially efficient, but easy to understand):

class String   def [](index)     self.scan(/./)[index]   end   def []=(index,value)     arr = self.scan(/./)     arr[index] = value     self.replace(arr.join)     value   end end

Of course, this omits much of the functionality of the real [] method, which can understand ranges, regular expressions, and so on. If you really want this functionality, you will have to do some more coding.

The unpack method has options that help us manipulate Unicode strings. By using U* as a directive in the template string, we can convert UTF-8 strings into arrays of codepoints (U on its own will convert only the first codepoint):

codepoints = sword.unpack('U*') # [233, 112, 233, 101]

Here is a slightly more useful example, which converts all non-ASCII codepoints (everything from U+0080 up) in a string into the U+XXXX notation we used earlier:

def reveal_non_ascii(str)   str.unpack('U*').map do |cp|     if cp < 0x80       cp.chr     else       '(U+%04X)'% cp     end   end.join end

The String#unpack method has a cousin, Array#pack, which performs the inverse operation:

[233, 112, 233, 101].pack('U*') # "épée"

We can use this to allow us to insert Unicode characters that we can't easily type:

eacute = [0xE9].pack('U') cafe = "caf#{eacute}"      # "café"

Regular expressions also are multibyte-aware, especially if you are using Oniguruma (which we looked at in Chapter 3, "Working with Regular Expressions"). For example, /./ will match a single multibyte character.

The u modifier will make a regex UTF-8-aware. If $KCODE is set to "u", this isn't necessary; but the redundancy doesn't hurt anything. (And such redundancy can be useful when the code is part of a larger context where we don't necessarily know how $KCODE is set.)

Even without Oniguruma, regexes are smart enough to recognize multibyte characters as being "word" characters or not.

$KCODE = "u" sword =~ /\w/    # 0 sword =~ /\W/    # nil

With Oniguruma, the backslash sequences (such as \w, \s) recognize a wider range of codepoints as being words, spaces, and so on.

Regular expressions let us perform some simple string actions in a safe manner. We can already truncate an ASCII string easily. The following code will return at most 20 characters from ascii_string:

ascii_string[0,20]

However, because a Unicode codepoint can span more than one byte, we can't safely use the same technique in a UTF-8 string. There's a risk that invalid byte sequences will be left on the end of the string. In addition, it's less useful because we can't tell in advance how many codepoints are going to result. Regular expressions come to our rescue:

def truncate(str, max_length)   str[/.{0,#{max_length}}/m] end

4.2.3. Detecting Character Encodings

Detecting which encoding is used by a given string is a complex problem. Multibyte encodings often have distinctive patterns that can be used to recognize them, but single-byte encodingslike most of the ones used in Western languagesare much more difficult. Statistical solutions can be used for detection, but they are outside the scope of this book (and they are not especially reliable solutions in general).

Fortunately, however, we usually want to do something more simpleto determine whether a string is UTF-8. This can be determined with good reliability. Here's a simple method (depending on the fact that unpack raises an exception on an invalid string):

class String   def utf8?     unpack('U*') rescue return false     true   end end

We can detect pure ASCII just by verifying that every byte is less than 128:

class String   def ascii?     self.split(/./).all? {|ch| ch < 128 }   end end

4.2.4. Normalizing Unicode Strings

Up until now, we've been using precomposed charactersones in which the base character and diacritic are combined into a single entity and a single Unicode codepoint. In general, though, Unicode separates the encoding of characters and their diacritics. Instead of storing "é" as a single LATIN SMALL LETTER E WITH ACUTE ACCENT codepoint, we can store it in a decomposed form, as LATIN SMALL LETTER E plus COMBINING ACUTE ACCENT.

Why would we want to do this? It provides flexibility and allows us to apply diacritic marks to any character, not just the combinations considered by the encoding designer. In fact, fonts will include glyphs for common combinations of character and diacritic, but the display of an entity is separate from its encoding.

Unicode has numerous design considerations such as efficiency and round-trip compatibility with existing national encodings. Sometimes these constraints may introduce some redundancy; for example, not only does Unicode include codepoints for decomposed forms but also for many of the precomposed forms already in use. This means that there is also a codepoint for LATIN SMALL LETTER E WITH ACUTE ACCENT, as well as for things such as the double-f ligature.

For example, let's consider the German word "öffnen" (to open). Without even considering case, there are four ways to encode it:

1. o + COMBINING DIAERESIS (U+0308) + f + f + n + e + n 2. LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) + f + f + n + e + n 3. o + COMBINING DIAERESIS + DOUBLE-F LIGATURE (U+FB00) + n + e + n 4. LATIN SMALL LETTER O WITH DIAERESIS + DOUBLE-F LIGATURE + n + e + n

The diaeresis (also spelled dieresis) is simply a pair of dots over a character. In German it is called an umlaut.

Normalizing is the process of standardizing the character representations used. After normalizing, we can be sure that a given character is encoded in a particular way. Exactly what those forms are depends on what we are trying to achieve. Annex 15 of the Unicode Standard lists four normalization forms:

1. D  (Canonical Decomposition) 2. C  (Canonical Decomposition followed by Canonical Composition) 3. KD (Compatibility Decomposition) 4. KC (Compatibility Decomposition followed by Canonical Composition)

You'll also see these written as NFKC (Normalization Form KC) and so on.

The precise rules set out in the standard are complex and cover the difference between "canonical equivalence" and "compatibility equivalence." (Korean and Japanese require particular attention, but we won't address these here.) Table 4.2 summarizes the effects of each normalization form on the strings we started with previously.

Table 4.2. Normalized Unicode Forms

Forms C and D are reversible, whereas KC and KD are not. On the other hand, the data lost in KC and KD means that all four strings are binary-identical. Which form is most appropriate depends on the application at hand. We'll talk about this a bit more in the next section.

Although Ruby doesn't include it as standard, a library is available that performs these normalizations. Refer to http://www.yoshidam.net/Ruby.html (installed via gem install unicode).

With the unicode library installed, it's easy to perform normalization for each of the previous forms with the Unicode.normalize_X family of methods:

require 'unicode' sword_kd = Unicode.normalize_KD(sword) sword_kd.scan(/./)                   # ["e", "´", "p", "e", "´", "e"] sword_kc = Unicode.normalize_KC(sword) sword_kc.scan(/./)                   # ["é", "p", "é", "e"]

4.2.5. Issues in String Collation

In computing, collation refers to the process of arranging text according to a particular order. Generally, but not always, this implies some kind of alphabetical or similar order. Collation is closely connected to normalization and uses some of the same concepts and code.

For example, let's consider an array of strings that we want to collate:

eacute = [0x00E9].pack('U') acute = [0x0301].pack('U') array = ["epicurian", "#{eacute}p#{eacute}e", "e#{acute}lan"] # ["epicurian", "épée", "élan"]

What happens when we use Ruby's Array#sort method?

array.sort   # ["epicurian", "élan", "épée"]

That's not what we want. But let's try to understand why it happens. Ruby's string sort is done by a simple byte-by-byte comparison. We can see this by looking at the first few bytes of each string:

array.map {|item| "#{item}: #{item.unpack('C*')[0,3].join(',')}" } # ["epicurian: 101,112,105", "épée: 195,169,112", # "e´lan: 101,204,129"]

There are two complications. First, the fact that non-ASCII UTF-8 characters start with a large byte value means that they will inevitably be sorted after their ASCII counterparts. Second, decomposed Latin characters are sorted before precomposed characters because of their leading ASCII byte.

Operating systems typically include collation functions that allow two strings to be compared according to the locale's encoding and language specifications. In the case of the C locale, this is handled by the strxfrm and strcoll functions in the standard library.

Bear in mind that this is something of an issue even with ASCII. When we sort ASCII strings in Ruby, we're doing a straight lexicographic sort; in a complex real-life situation (for example, sorting the titles in the Library of Congress) there are many special rules that aren't followed by such a simplistic sorting technique.

To collate strings, we can generate an intermediate value that is used to sort them. Exactly how we construct this value depends on our own requirements and those of the language that we are processing; there is no single universal collation algorithm.

Let's assume that we are processing our list according to English rules and that we are going to ignore accents. The first step is to define our transformation method. We'll normalize our strings to decomposed forms and then elide the diacritics, leaving just the base characters. The Unicode range for combining diacritical marks runs from U+0300 to U+036F:

def transform(str)   Unicode.normalize_KD(str).unpack('U*').select{ |cp|     cp < 0x0300 || cp > 0x036F   }.pack('U*') end array.map{|x| transform(x) }     # ["epicurian", "epee", "elan"]

Next, we create a hash table to map strings to their transformed versions and use that to sort the original strings. The hash table means that we only need to calculate the transformed form once per original string.

def collate(array)   transformations = array.inject({}) do |hash, item|     hash[item] = yield item     hash   end   array.sort_by {|x| transformations[x] } end collate(array) {|a| transform(a) }    # ["élan", "épée", "epicurian"]

That's better, but we haven't addressed capitalization or character equivalents yet. Let's look at German as an example.

In fact, there is more than one collation for German; we'll use the DIN-2 collation (or phone book collation) for this exercise, in which the German character "ß" is equivalent to "ss", and the umlaut is equivalent to a letter "e", so "ö" is equivalent to "oe" and so on.

Our transformation method should address this. Once again, we will start by normalizing our string to a decomposed form. For reference, the combining diaeresis (or umlaut) is U+0308. We'll also use Ruby's case conversion, but we need to augment it a little. Here, then, is a basic transformation method:

def transform_de(str)   decomposed = Unicode.normalize_KD(str).downcase   decomposed.gsub!('ß', 'ss')   decomposed.gsub([0x0308].pack('U'), 'e') end array = ["Straße", "öffnen"] array.map {|x| transform_de(x) }    # ["strasse", "oeffnen"]

Not all languages are so straightforward. Spanish, for example, adds an additional letter, "ñ", between "n" and "o". However, as long as we shift the remaining letters along somehow, we can cope with this. Notice how Listing 4.1 uses the precomposed normalized form to simplify our processing. We are also going to make things easier by ignoring the distinction between accented and nonaccented letters.

Listing 4.1. Collation in Spanish

def map_table(list)   table = {}   list.each_with_index do |item, i|     item.split(',').each do |subitem|       table[Unicode.normalize_KC(subitem)] = (?a + i).chr     end   end   table end ES_SORT = map_table(%w(   a,A,á,Á b,B c,C d,D e,E,é,É f,F g,G h,H i,I,í,Í j,J k,K l,L m,M   n,N ñ,Ñ o,O,ó,Ó p,P q,Q r,R s,S t,T u,U,ú,Ú v,V w,W x,X y,Y z,Z )) def transform_es(str)   array = Unicode.normalize_KC(str).scan(/./u)   array.map {|c| ES_SORT[c] || c}.join end array = %w[éste estoy año apogeo amor] array.map {|a| transform_es(a) } # ["etue", "etupz", "aop", "aqpgep", "amps"] collate(array) {|a| transform_es(a) } # ["amor", "año", "apogeo", "éste", "estoy"]

Real-world collation is slightly more complex than the preceding examples and typically employs up to three levels. Usually, the first level tests the character identity only, ignoring accents and case, the second level differentiates accents, while the third level also takes case into consideration. The second and third levels are used only if two strings have equal collation at preceding levels. Furthermore, some languages sort multiple-character sequences as a single semantic unit (for example, "lj" in Croatian is placed between "l" and "m"). The development of a language-specific or generalized collation algorithm is therefore not a trivial task; it demands knowledge of the language in question. It's also not possible to devise a truly generic collation algorithm that works for all languages, although there are algorithms that attempt to achieve something close.

4.2.6. Converting Between Encodings

Ruby's standard library includes an interface to the iconv library for converting between character encodings. This should work on all platforms, including Windows with the one-click installer.

If we want to convert a UTF-8 string to ISO-8859-15, we can use iconv as follows:

require 'iconv' converter = Iconv.new('ISO-8859-15', 'UTF-8') sword_iso = converter.iconv(sword)

It's important to remember that the encodings are listed in the order destination, source (rather like an assignment). The number and names of the encodings available depend on the platform, but the more common and popular encodings tend to be well standardized and available everywhere. If the iconv command-line program is available, we can get a list of the recognized encodings by issuing the iconv -l command.

In addition to the name of the encoding, iconv accepts a couple of switches to control its behavior. These are appended to the destination encoding string.

Usually, iconv raises an error if it encounters invalid input or if it otherwise cannot represent the input in the output encoding. The //IGNORE switch tells it to skip these errors silently:

broken_utf8_string = "hello\xfe" converter = Iconv.new('ISO-8859-15', 'UTF-8') converter.iconv(broken_utf8_string)    # raises Iconv::IllegalSequence converter = Iconv.new('ISO-8859-15//IGNORE', 'UTF-8') converter.iconv(broken_utf8_string)    # "hello"

The same switch also lets us clean up a string:

broken_sword = "épée\xfe" converter = Iconv.new('UTF-8//IGNORE', 'UTF-8') converter.iconv(broken_sword) # "épée"

Sometimes characters can't be represented in the target encoding. Usually these will raise an exception. The //TRANSLIT switch tells iconv to approximate characters (where possible) instead of raising an error:

converter = Iconv.new('ASCII', 'UTF-8') converter.iconv(sword)       # raises Iconv::IllegalSequence converter = Iconv.new('ASCII//IGNORE', 'UTF-8') converter.iconv(sword)       # "pe" converter = Iconv.new('ASCII//TRANSLIT', 'UTF-8') converter.iconv(sword)       # "'ep'ee"

We could use this to make an ASCII-clean URL, for example:

str = "Straße épée" converter = Iconv.new('ASCII//TRANSLIT', 'UTF-8') converter.iconv(sword).gsub(/ /, '-').gsub(/[^a-z\-]/in).downcase # "strasse-epee"

However, this will only work for the Latin alphabet.

Listing 4.2 illustrates a real-life example of iconv being used with open-uri to retrieve a web page and transcode its content into UTF-8.

Listing 4.2. Transcoding a Web Page into UTF-8

require 'open-uri' require 'iconv' def get_web_page_as_utf8(url)   open(url) do |io|     source = io.read     type, *parameters = io.content_type_parse     # Don't transcode anything that isn't (X)HTML     unless type =~ %r!^(?:text/html|application/xhtml+xml)$!       return source     end     # Check server headers first:     if pair = parameters.assoc('charset')       encoding = pair.last     # Next, look in the HTML:     elsif source =~ /\]*?charset=([^\s'"]+)/i       encoding = $1     # Finally, use the HTTP default     else       encoding = 'ISO-8859-1'     end     converter = Iconv.new('UTF-8//IGNORE', encoding)     return converter.iconv(source)   end end

There are other OS issues relating to character conversion. Suppose that the operating system on which Ruby is running is set to a non-UTF-8 locale, or Ruby doesn't use UTF-8 to communicate with the OS (as is the case with the Win32 package). Then there are additional complications.

For example, Windows supports Unicode filenames and uses Unicode internally. But Ruby at the present time communicates with Windows through the legacy code page. In the case of English and most other western European editions, this is code page 1252 (or WINDOWS-1252).

You can still use UTF-8 inside your application, but you'll need to convert to the legacy code page to specify filenames. This can be done using iconv, but it's important to remember that the legacy code page can describe only a small subset of the characters available in Unicode.

In addition, this means that Ruby on Windows cannot, at present, open existing files whose names cannot be described in the legacy code page. This restriction does not apply to Mac OS X, Linux, or other systems using UTF-8 locales.