You want to index a number of texts and do quick keyword searches on them.
Use the SimpleSearch library, available in the SimpleSearch gem.
Heres how to create and save an index:
require ubygems require search/simple contents = Search::Simple::Contents.new contents << Search::Simple::Content. new(In the beginning God created the heavens…, Genesis.txt, Time.now) contents << Search::Simple::Content.new(Call me Ishmael…, MobyDick.txt, Time.now) contents << Search::Simple::Content.new(Marley was dead to begin with…, AChristmasCarol.txt, Time.now) searcher = Search::Simple::Searcher.load(contents, index_file)
Heres how to load and search an existing index:
require ubygems require search/simple searcher = nil open(index_file) do |f| searcher = Search::Simple::Searcher.new(Marshal.load(f), Marshal.load(f), index_file) end searcher.find_words([egin]).results.collect { |result| result.name } # => ["AChristmasCarol.txt", "Genesis.txt"]
SimpleSearch is a library that makes it easy to do fast keyword searching on unstructured text documents. The index itself is represented by a Searcher object, and each document you feed it is a Content object.
To create an index, you must first construct a number of Content objects and a Contents object to contain them. A Content object contains a piece of text, a unique identifier for that text (often a filename, though it could also be a database ID or a URL), and the time at which the text was last modified. Searcher.load transforms a Contents object into a searchable index that gets serialized to disk with Marshal.
The indexer analyzes the text you gives it, removes stop words (like "a"), truncates words to their roots (so "beginning" becomes "begin"), and puts every word of the text into binary data structures. Given a set of words to find and a set of words to exclude, SimpleSearch uses these structures to quickly find a set of documents.
Heres how to add some new documents to an existing index:
class Search::Simple::Searcher def add_contents(contents) Search::Simple::Searcher.create_indices(contents, @dict, @document_vectors) dump # Re-serialize the file end end contents = Search::Simple::Contents.new contents << Search::Simple::Content.new(A spectre is haunting Europe…, TheCommunistManifesto.txt, Time.now) searcher.add_contents(contents) searcher.find_words([spectre]).results[0].name # => "TheCommunistManifesto.txt"
SimpleSearch doesn support incremental indexing. If you update or delete a document, you must recreate the entire index from scratch.
Strings
Numbers
Date and Time
Arrays
Hashes
Files and Directories
Code Blocks and Iteration
Objects and Classes8
Modules and Namespaces
Reflection and Metaprogramming
XML and HTML
Graphics and Other File Formats
Databases and Persistence
Internet Services
Web Development Ruby on Rails
Web Services and Distributed Programming
Testing, Debugging, Optimizing, and Documenting
Packaging and Distributing Software
Automating Tasks with Rake
Multitasking and Multithreading
User Interface
Extending Ruby with Other Languages
System Administration