Processing a String One Character at a Time

Problem

You want to process each character of a string individually.

Solution

If you're processing an ASCII document, then each byte corresponds to one character. Use String#each_byte to yield each byte of a string as a number, which you can turn into a one-character string:

	'foobar'.each_byte { |x| puts "#{x} = #{x.chr}" }
	# 102 = f
	# 111 = o
	# 111 = o
	# 98 = b
	# 97 = a
	# 114 = r

Use String#scan to yield each character of a string as a new one-character string:

	'foobar'.scan( /./ ) { |c| puts c }
	# f
	# o
	# o
	# b
	# a
	# r

 

Discussion

Since a string is a sequence of bytes, you might think that the String#each method would iterate over the sequence, the way Array#each does. But String#each is actually used to split a string on a given record separator (by default, the newline):

	"foo
bar".each { |x| puts x }
	# foo
	# bar

The string equivalent of Array#each method is actually each_byte. A string stores its characters as a sequence of Fixnum objects, and each_bytes yields that sequence.

String#each_byte is faster than String#scan, so if you're processing an ASCII file, you might want to use String#each_byte and convert to a string every number passed into the code block (as seen in the Solution).

String#scan works by applying a given regular expression to a string, and yielding each match to the code block you provide. The regular expression /./ matches every character in the string, in turn.

If you have the $KCODE variable set correctly, then the scan technique will work on UTF-8 strings as well. This is the simplest way to sneak a notion of "character" into Ruby's byte-based strings.

Here's a Ruby string containing the UTF-8 encoding of the French phrase "ça va":

	french = "xc3xa7a va"

Even if your terminal can't properly display the character "ç", you can see how the behavior of String#scan changes when you make the regular expression Unicodeaware, or set $KCODE so that Ruby handles all strings as UTF-8:

	french.scan(/./) { |c| puts c }
	#
	#
	# a
	#
	# v
	# a

	french.scan(/./u) { |c| puts c }
	# ç
	# a
	#
	# v
	# a

	$KCODE = 'u'
	french.scan(/./) { |c| puts c }
	# ç
	# a
	#
	# v
	# a

Once Ruby knows to treat strings as UTF-8 instead of ASCII, it starts treating the two bytes representing the "ç" as a single character. Even if you can't see UTF-8, you can write programs that handle it correctly.

See Also

  • Recipe 11.12, "Converting from One Encoding to Another"


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net