Indexing Unstructured Text with SimpleSearch

Problem

You want to index a number of texts and do quick keyword searches on them.

Solution

Use the SimpleSearch library, available in the SimpleSearch gem.

Heres how to create and save an index:

	require 
ubygems
	require search/simple

	contents = Search::Simple::Contents.new
	contents << Search::Simple::Content.
	 new(In the beginning God created the heavens…,
	 Genesis.txt, Time.now)
	contents << Search::Simple::Content.new(Call me Ishmael…,
	 MobyDick.txt, Time.now)
	contents << Search::Simple::Content.new(Marley was dead to begin with…,
	 AChristmasCarol.txt, Time.now)

	searcher = Search::Simple::Searcher.load(contents, index_file)

Heres how to load and search an existing index:

	require 
ubygems
	require search/simple

	searcher = nil
	open(index_file) do |f|
	 searcher = Search::Simple::Searcher.new(Marshal.load(f), Marshal.load(f),
	 index_file)
	end

	searcher.find_words([egin]).results.collect { |result| result.name }
	# => ["AChristmasCarol.txt", "Genesis.txt"]

Discussion

SimpleSearch is a library that makes it easy to do fast keyword searching on unstructured text documents. The index itself is represented by a Searcher object, and each document you feed it is a Content object.

To create an index, you must first construct a number of Content objects and a Contents object to contain them. A Content object contains a piece of text, a unique identifier for that text (often a filename, though it could also be a database ID or a URL), and the time at which the text was last modified. Searcher.load transforms a Contents object into a searchable index that gets serialized to disk with Marshal.

The indexer analyzes the text you gives it, removes stop words (like "a"), truncates words to their roots (so "beginning" becomes "begin"), and puts every word of the text into binary data structures. Given a set of words to find and a set of words to exclude, SimpleSearch uses these structures to quickly find a set of documents.

Heres how to add some new documents to an existing index:

	class Search::Simple::Searcher
	 def add_contents(contents)
	 Search::Simple::Searcher.create_indices(contents, @dict,
	 @document_vectors)
	 dump # Re-serialize the file
	 end
	end

	contents = Search::Simple::Contents.new
	contents << Search::Simple::Content.new(A spectre is haunting Europe…,
	 TheCommunistManifesto.txt, Time.now)
	searcher.add_contents(contents)
	searcher.find_words([spectre]).results[0].name
	# => "TheCommunistManifesto.txt"

SimpleSearch doesn support incremental indexing. If you update or delete a document, you must recreate the entire index from scratch.

See Also

  • The SimpleSearch home page (http://www.chadfowler.com/SimpleSearch/)
  • The sample application within the SimpleSearch gem: search-simple.rb
  • Recipe 13.2, "Serializing Data with Marshal"
  • For a more sophisticated indexer, see Recipe 13.5, "Indexing Structured Text with Ferret"


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net