Indexing Structured Text with Ferret

Table of contents:

Problem

You want to perform searches on structured text. For instance, you might want to search just the headline of a news story, or just the body.

Discussion

The Ferret library can tokenize and search structured data. Its a pure Ruby port of Javas Lucene library, and its available as the ferret gem.

Heres how to create and populate an index with Ferret. Ill create a searchable index of useful Ruby packages, stored as a set of binary files in the ruby_packages/ directory.

	require 
ubygems
	require ferret

	PACKAGE_INDEX_DIR = 
uby_packages/
	Dir.mkdir(PACKAGE_INDEX_DIR) unless File.directory? PACKAGE_INDEX_DIR
	index = 
Ferret::Index::Index.new(:path => PACKAGE_INDEX_DIR,
	 :default_search_field => 
ame|description)
	index << { :name => SimpleSearch,
	 :description => A simple indexing library.,
	 :supports_structured_data => false,
	 :complexity => 2 }
	index << { :name =>  
Ferret,
	 :description => A Ruby port of the Lucene library.
	 More powerful than SimpleSearch,
	 :supports_structured_data => true,
	 :complexity => 5 }

By default, queries against this index will search the "name" and "description" fields, but you can search against any field:

	index.search_each(library) do |doc_id, score|
	 puts index.doc(doc_id).field(
ame).data
	end
	# SimpleSearch
	# Ferret

	index.search_each(description:powerful AND supports_structured_data:true) do
	|doc_id, score|
	 puts index.doc(doc_id).field("name").data
	end
	# Ferret

	index.search_each("complexity:<5") do |doc_id, score|
	 puts index.doc(doc_id).field("name").data
	end
	# SimpleSearch

Discussion

When should you use Ferret instead of SimpleText? SimpleText is good for unstructured data like plain text. Ferret excels at searching structured data, the kind you find in databases.

Relational databases are good at finding exact field matches, but not very good at locating keywords within large strings. Ferret works best when you need full text search but you want to keep some of the document structure. Ive also had great success using Ferret^[6] to bring together data from disparate sources (some in databases, some not) into one structured, searchable index.

^[6] Actually, I was using Lucene. Same idea.

There are two things you can do with Ferret: add text to the index, and query the index. Ferret offers you a lot of control over both activities. Ill briefly cover the most interesting features.

You can feed an index by passing in a hash of field names to values, or you can feed it fully formed Ferret::Document objects. This gives you more control over which fields youd like to index. Here, Ill create an index of news stories taken from a hypothetical database:

	# This include will cut down on the length of the Field:: constants below.
	include 
Ferret::Document

	def index_story(index, db_id, headline, story)
	 doc = Document.new
	 doc << Field.new("db_id", db_id, Field::Store::YES, Field::Index::NO)
	 doc << Field.new("headline", headline, Field::Store::YES, Field::Index::TOKENIZED)
	 doc << Field.new("story", story, Field::Store::NO, Field::Index::TOKENIZED)
	 index << doc
	end

	STORY_INDEX_DIR = 
ews_stories/
	Dir.mkdir(STORY_INDEX_DIR) unless File.directory? STORY_INDEX_DIR
	index = Ferret::Index::Index.new(:path => STORY_INDEX_DIR)

	index_story(index, 1, "Lizardoids Control the Media, Sources Say",
	 "Don	 count on reading this story in your local paper anytime
	 soon, because …")

	index_story(index, 2, "Where Are My Pants? An Editorial",
	 "This is an outrage. The lizardoids have gone too far! …")

In this case, Im storing the database ID in the Document, but Im not indexing it. I don want anyone to search on it, but I need some way of tying a Document in the index to a record in the database. That way, when someone does a search, I can print out the headline and provide a link to the original story.

I treat the body of the story exactly the opposite way: the words get indexed, but the original text is not stored and can be recovered from the Document object. Im not going to be displaying the text of the story along with my search results, and the text is already in the database, so why store it again in the index?

The simplest way to search a Ferret index is with Index#search_each, as demonstrated in the Solution. This takes a query and a code block. For each document that matched the search query, it yields the document ID and a number between 0 and 1, representing the quality of the match.

You can get more information about the search results by calling search instead of search_each. This gives you a Ferret::Search::TopDocs object that contains the search results, as well as useful information like how many documents were matched. Call each on a TopDocs object and itll act just as if youd called search_each.

Heres some code that does a search and prints the results:

	def search_news(index, query)
	 results = index.search(query)
	 puts "#{results.size} article(s) matched:"

	 results.each do |doc_id, score|
	 story = index.doc(doc_id)
	 puts " #{story.field("headline").data} (score: #{score})"
	 puts " http://www.example.com/news/#{story.field("db_id").data}"
	 puts
	 end
	end

	search_news(index, "pants editorial")
	# 1 article(s) matched:
	# Where Are My Pants? An Editorial (score: 0.0908329636861293)
	# http://www.example.com/news/2

You can weight the fields differently to fine-tune the results. This query makes a match in the headline count twice as much as a match in the story:

	search_news(index, "headline:lizardoids^1 OR story:lizardoids^0.5")
	# 2 article(s) matched:
	# Lizardoids Control the Media, Sources Say (score: 0.195655948031232)
	# http://www.example.com/news/1
	#
	# Where Are My Pants? An Editorial (score: 0.0838525491562421)
	# http://www.example.com/news/2

Queries can be strings or Ferret::Search::Query objects. Pass in a string, and it just gets parsed and turned into a Query. The main advantage of creating your own Query objects is that you can put a user-friendly interface on your search functionality, instead of making people always construct Ferret queries by hand. The weighted_query method defined below takes a single keyword and creates a Query object equivalent to the rather complicated weighted query given above:

	def weighted_query(term)
	 query = Ferret::Search::BooleanQuery.new
	 query << term_clause("headline", term, 1)
	 query << term_clause("story", term, 0.5)
	end

	def term_clause(field, term, weight)
	 t = Ferret::Search::TermQuery.new(Ferret::Index::Term.new(field, term))
	 t.boost = weight
	 return Ferret::Search::BooleanClause.new(t)
	end

Ferret can be clumsy to use. Its got a lot of features to learn, and sometimes it seems like you spend all your time composing small objects into bigger objects (as in weighted_query above, which creates instances of four different classes). This is partly because Ferret is so flexible, and partly because the API comes mainly from Java. But nothing else works as well for searching structured text.

Problem

Discussion

Discussion

See Also