Processing a String One Character at a Time

Problem

You want to process each character of a string individually.

Solution

If you're processing an ASCII document, then each byte corresponds to one character. Use String#each_byte to yield each byte of a string as a number, which you can turn into a one-character string:

	'foobar'.each_byte { |x| puts "#{x} = #{x.chr}" }
	# 102 = f
	# 111 = o
	# 111 = o
	# 98 = b
	# 97 = a
	# 114 = r

Use String#scan to yield each character of a string as a new one-character string:

	'foobar'.scan( /./ ) { |c| puts c }
	# f
	# o
	# o
	# b
	# a
	# r

Discussion

Since a string is a sequence of bytes, you might think that the String#each method would iterate over the sequence, the way Array#each does. But String#each is actually used to split a string on a given record separator (by default, the newline):

	"foo
bar".each { |x| puts x }
	# foo
	# bar

The string equivalent of Array#each method is actually each_byte. A string stores its characters as a sequence of Fixnum objects, and each_bytes yields that sequence.

String#each_byte is faster than String#scan, so if you're processing an ASCII file, you might want to use String#each_byte and convert to a string every number passed into the code block (as seen in the Solution).

String#scan works by applying a given regular expression to a string, and yielding each match to the code block you provide. The regular expression /./ matches every character in the string, in turn.

If you have the $KCODE variable set correctly, then the scan technique will work on UTF-8 strings as well. This is the simplest way to sneak a notion of "character" into Ruby's byte-based strings.

Here's a Ruby string containing the UTF-8 encoding of the French phrase "ça va":

	french = "xc3xa7a va"

Even if your terminal can't properly display the character "ç", you can see how the behavior of String#scan changes when you make the regular expression Unicodeaware, or set $KCODE so that Ruby handles all strings as UTF-8:

	french.scan(/./) { |c| puts c }
	#
	#
	# a
	#
	# v
	# a

	french.scan(/./u) { |c| puts c }
	# ç
	# a
	#
	# v
	# a

	$KCODE = 'u'
	french.scan(/./) { |c| puts c }
	# ç
	# a
	#
	# v
	# a

Once Ruby knows to treat strings as UTF-8 instead of ASCII, it starts treating the two bytes representing the "ç" as a single character. Even if you can't see UTF-8, you can write programs that handle it correctly.