Handling International Encodings

Problem

You need to handle strings that contain nonASCII characters: probably Unicode characters encoded in UTF-8.

Solution

To use Unicode in Ruby, simply add the following to the beginning of code.

	$KCODE='u'
	require 'jcode'

You can also invoke the Ruby interpreter with arguments that do the same thing:

	$ ruby -Ku -rjcode

If you use a Unix environment, you can add the arguments to the shebang line of your Ruby application:

	#!/usr/bin/ruby -Ku -rjcode

The jcode library overrides most of the methods of String and makes them capable of handling multibyte text. The exceptions are String#length, String#count, and String#size, which are not overridden. Instead jcode defines three new methods: String#jlength, string#jcount, and String#jsize.

Discussion

Consider a UTF-8 string that encodes six Unicode characters: efbca1 (A), efbca2 (B), and so on up to UTF-8 efbca6 (F):

	string = "xefxbcxa1" + "xefxbcxa2" + "xefxbcxa3" +
	 "xefxbcxa4" + "xefxbcxa5" + "xefxbcxa6"

The string contains 18 bytes that encode 6 characters:

	string.size # => 18
	string.jsize # => 6

String#count is a method that takes a strong of bytes, and counts how many times those bytes occurs in the string. String#jcount takes a string of characters and counts how many times those characters occur in the string:

	string.count "xefxbcxa2" # => 13
	string.jcount "xefxbcxa2" # => 1

String#count treats "xefxbcxa2" as three separate bytes, and counts the number of times each of those bytes shows up in the string. String#jcount TReats the same string as a single character, and looks for that character in the string, finding it only once.

	"xefxbcxa2".length # => 3
	"xefxbcxa2".jlength # => 1

Apart from these differences, Ruby handles most Unicode behind the scenes. Once you have your data in UTF-8 format, you really don't have to worry. Given that Ruby's creator Yukihiro Matsumoto is Japanese, it is no wonder that Ruby handles Unicode so elegantly.

See Also

  • If you have text in some other encoding and need to convert it to UTF-8, use the iconv library, as described in Recipe 11.2, "Extracting Data from a Document's Tree Structure"
  • There are several online search engines for Unicode characters; two good ones are at http://isthisthingon.org/unicode/ and http://www.fileformat.info/info/unicode/char/search.htm


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net