Classifying Text with a Bayesian Analyzer

Problem

You want to classify chunks of text by example: an email message is either spam or not spam, a joke is either funny or not funny, and so on.

Solution

Use Lucas Carlson's Classifier library, available as the classifier gem. It provides a naive Bayesian classifier, and one that implements Latent Semantic Indexing, a more advanced technique.

The interface for the naive Bayesian classifier is very straightforward. You create a Classifier::Bayes object with some classifications, and train it on text chunks whose classification is known:

	require 'rubygems'
	require 'classifier'

	classifier = Classifier::Bayes.new('Spam', 'Not spam')

	classifier.train_spam 'are you in the market for viagra? we sell viagra'
	classifier.train_not_spam 'hi there, are we still on for lunch?'

You can then feed the classifier text chunks whose classification is unknown, and have it guess:

	classifier.classify "we sell the cheapest viagra on the market"
	# => "Spam"
	classifier.classify "lunch sounds great"
	# => "Not spam"

 

Discussion

Bayesian analysis is based on probablities. When you train the classifier, you are giving it a set of words and the classifier keeps track of how often words show up in each category. In the simple spam filter built in the Solution, the frequency hash looks like the @categories variable below:

	classifier
	# => #
	# { :lunch=>1, :for=>1, :there=>1,
	# :"?"=>1, :still=>1, :","=>1 },
	# :Spam=>
	# { :market=>1, :for=>1, :viagra=>2, :"?"=>1, :sell=>1 }
	# },
	# @total_words=12>

These hashes are used to build probability calculations. Note that since we mentioned the word "viagra" twice in spam messages, there is a 2 in the "Spam" frequency hash for that word. That makes it more spam-like than other words like "for" (which also shows up in nonspam) or "sell" (which only shows up once in spam). The classifier can apply these probabilities to previously unseen text and guess at a classification for it.

The more text you use to train the classifier, the better it becomes at guessing. If you can verify the classifier's guesses (for instance, by asking the user whether a message really was spam), you should use that information to train the classifier with new data as it comes in.

To save the state of the classifier for later use, you can use Madeleine persistence (Recipe 13.3), which writes the state of your classifier to your hard drive.

A few more notes about this type of classifier. A Bayesian classifier supports as many categories as you want. "Spam" and "Not spam" are the most common, but you are not limited to two. You can also use the generic train method instead of calling train_[category_name]. Here's a classifier that has three categories and uses the generic train method:

	classifier = Classifier::Bayes.new('Interesting', 'Funny', 'Dramatic')

	classifier.train 'Interesting', "Leaving reminds us of what we can part
	 with and what we can't, then offers us something new to look forward
	 to, to dream about."
	classifier.train 'Funny', "Knock knock. Who's there? Boo boo. Boo boo
	 who? Don't cry, it is only a joke."
	classifier.train 'Dramatic', 'I love you! I hate you! Get out right
	 now.'

	classifier.classify 'what!'
	# => "Dramatic"
	classifier.classify "who's on first?"
	# => "Funny"
	classifier.classify 'perchance to dream'
	# => "Interesting"

It's also possible to "untrain" a category if you make a mistake or change your mind later.

	classifier.untrain_funny "boo"
	classifier.untrain "Dramatic", "out"

 

See Also

  • Recipe 13.3, "Persisting Objects with Madeleine"
  • The README file for the Classifier library has an example of an LSI classifier
  • Bishop (http://bishop.rubyforge.org/) is another Bayesian classifier, a port of Python's Reverend; it's available as the bishop gem
  • http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  • http://en.wikipedia.org/wiki/Latent_Semantic_Analysis


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net