Processing a String One Word at a Time

Problem

You want to split a piece of text into words, and operate on each word.

Solution

First decide what you mean by "word." What separates one word from another? Only whitespace? Whitespace or punctuation? Is "johnny-come-lately" one word or three? Build a regular expression that matches a single word according to whatever definition you need (there are some samples are in the Discussion).

Then pass that regular expression into String#scan. Every word it finds, it will yield to a code block. The word_count method defined below takes a piece of text and creates a histogram of word frequencies. Its regular expression considers a "word" to be a string of Ruby identifier characters: letters, numbers, and underscores.

	class String
	 def 
word_count
	 frequencies = Hash.new(0)
	 downcase.scan(/w+/) { |word| frequencies[word] += 1 }
	 return frequencies
	 end
	end

	%{Dogs dogs dog dog dogs.}.word_count
	# => {"dogs"=>3, "dog"=>2}
	%{"I have no shame," I said.}.word_count
	# => {"no"=>1, "shame"=>1, "have"=>1, "said"=>1, "i"=>2}

Discussion

The regular expression /w+/ is nice and simple, but you can probably do better for your application's definition of "word." You probably don't consider two words separated by an underscore to be a single word. Some English words, like "pan-fried" and "fo'c'sle", contain embedded punctuation. Here are a few more definitions of "word" in regular expression form:

	# Just like /w+/, but doesn't consider underscore part of a word.
	/[0-9A-Za-z]/

	# Anything that's not whitespace is a word.
	/[^S]+/

	# Accept dashes and apostrophes as parts of words.
	/[-'w]+/

	# A pretty good heuristic for matching English words.
	/(w+([-'.]w+)*/

The last one deserves some explanation. It matches embedded punctuation within a word, but not at the edges. "Work-in-progress" is recognized as a single word, and "-never-" is recognized as the word "never" surrounded by punctuation. This regular expression can even pick out abbreviations and acronyms such as "Ph.D" and "U.N.C.L.E.", though it can't distinguish between the final period of an acronym and the period that ends a sentence. This means that "E.F.F." will be recognized as the word "E.F.F" and then a nonword period.

Let's rewrite our word_count method to use that regular expression. We can't use the original implementation, because its code block takes only one argument. String#scan passes its code block one argument for each match group in the regular expression, and our improved regular expression has two match groups. The first match group is the one that actually contains the word. So we must rewrite word_count so that its code block takes two arguments, and ignores the second one:

	class String
	 def word_count
	 frequencies = Hash.new(0)
	 
downcase.scan(/(w+([-'.]w+)*)/) { |word, ignore| frequencies[word] += 1 }
	 return frequencies
	 end
	end

	%{"That F.B.I. fella--he's quite the man-about-town."}.word_count
	# => {"quite"=>1, "f.b.i"=>1, "the"=>1, "fella"=>1, "that"=>1,
	# "man-about-town"=>1, "he's"=>1}

Note that the "w" character set matches different things depending on the value of $KCODE. By default, "w" matches only characters that are part of ASCII words:

	french = "il xc3xa9tait une fois"
	french.word_count
	# => {"fois"=>1, "une"=>1, "tait"=>1, "il"=>1}

If you turn on Ruby's UTF-8 support, the "w" character set matches more characters:

	$KCODE='u'
	french.word_count
	# => {"fois"=>1, "une"=>1, "était"=>1, "il"=>1}

The regular expression group matches a word boundary: that is, the last part of a word before a piece of whitespace or punctuation. This is useful for String#split (see Recipe 1.4), but not so useful for String#scan.