Classifying Text with a Bayesian Analyzer

Problem

You want to classify chunks of text by example: an email message is either spam or not spam, a joke is either funny or not funny, and so on.

Solution

Use Lucas Carlson's Classifier library, available as the classifier gem. It provides a naive Bayesian classifier, and one that implements Latent Semantic Indexing, a more advanced technique.

The interface for the naive Bayesian classifier is very straightforward. You create a Classifier::Bayes object with some classifications, and train it on text chunks whose classification is known:

	require 'rubygems'
	require 'classifier'

	classifier = Classifier::Bayes.new('Spam', 'Not spam')

	classifier.train_spam 'are you in the market for viagra? we sell viagra'
	classifier.train_not_spam 'hi there, are we still on for lunch?'

You can then feed the classifier text chunks whose classification is unknown, and have it guess:

	classifier.classify "we sell the cheapest viagra on the market"
	# => "Spam"
	classifier.classify "lunch sounds great"
	# => "Not spam"

Discussion

Bayesian analysis is based on probablities. When you train the classifier, you are giving it a set of words and the classifier keeps track of how often words show up in each category. In the simple spam filter built in the Solution, the frequency hash looks like the @categories variable below:

	classifier
	# => #
	# { :lunch=>1, :for=>1, :there=>1,
	# :"?"=>1, :still=>1, :","=>1 },
	# :Spam=>
	# { :market=>1, :for=>1, :viagra=>2, :"?"=>1, :sell=>1 }
	# },
	# @total_words=12>

These hashes are used to build probability calculations. Note that since we mentioned the word "viagra" twice in spam messages, there is a 2 in the "Spam" frequency hash for that word. That makes it more spam-like than other words like "for" (which also shows up in nonspam) or "sell" (which only shows up once in spam). The classifier can apply these probabilities to previously unseen text and guess at a classification for it.

The more text you use to train the classifier, the better it becomes at guessing. If you can verify the classifier's guesses (for instance, by asking the user whether a message really was spam), you should use that information to train the classifier with new data as it comes in.

To save the state of the classifier for later use, you can use Madeleine persistence (Recipe 13.3), which writes the state of your classifier to your hard drive.

A few more notes about this type of classifier. A Bayesian classifier supports as many categories as you want. "Spam" and "Not spam" are the most common, but you are not limited to two. You can also use the generic train method instead of calling train_[category_name]. Here's a classifier that has three categories and uses the generic train method:

	classifier = Classifier::Bayes.new('Interesting', 'Funny', 'Dramatic')

	classifier.train 'Interesting', "Leaving reminds us of what we can part
	 with and what we can't, then offers us something new to look forward
	 to, to dream about."
	classifier.train 'Funny', "Knock knock. Who's there? Boo boo. Boo boo
	 who? Don't cry, it is only a joke."
	classifier.train 'Dramatic', 'I love you! I hate you! Get out right
	 now.'

	classifier.classify 'what!'
	# => "Dramatic"
	classifier.classify "who's on first?"
	# => "Funny"
	classifier.classify 'perchance to dream'
	# => "Interesting"

It's also possible to "untrain" a category if you make a mistake or change your mind later.

	classifier.untrain_funny "boo"
	classifier.untrain "Dramatic", "out"