Organizing XML Data Using Lucene

The Jakarta Lucene project is text search engine that can do a full-text search of various types of documents. Even though Jakarta Lucene is not part of the Commons project, this project deserves notice since it is very unique and interesting. Within the Apache XML group is another project called Xindice , which is an XML database. That project will not be covered in this book. Had this book had been a pure XML development book, we would be discussing XML Xindice instead of Jakarta Lucene . The purpose of this book, however, is to show how you can easily create a document-processing framework using Java programming techniques.

The Jakarta Lucene text-search engine is unique because it combines text search with SQL abilities , which can uniquely define search attributes. This is an interesting combination because part of the strength of SQL is its ability to find specific pieces of data. Search engines are good at indexing any type of information but have traditionally had problems finding the "needle in the haystack." In the betwixt section, we converted Java classes into XML files. The XML files are used as the basis of the index in the Jakarta Lucene program.

Technical Details for the Jakarta Lucene Program

Tables 5.5 and 5.6 contain the abbreviated details necessary to use the Lucene program.

Table 5.5: Repository details for the Jakarta *Lucen* e program.
Item	Details
CVS repository	jakarta-lucene
Directory within repository	Not applicable
Main packages used	org.apache.lucene.index., org.apache.lucene.analysis., org.apache.lucene.document., org.apache.lucene.search., org.apache.lucene.queryParser.*

Table 5.6: Package and class details (legend: [lucene] = org.apache.lucene).
Class/Interface	Details
[lucene].index.IndexWriter	A class used to manage, write, and update an index. This class is not thread safe and therefore must have synchronized access to it.
[lucene].analysis.Analyzer	A base interface used to define a word analyzer and tokenizer, which reads text and figures out the pieces of the text.
[lucene].analysis.standard.StandardAnalyzer	A standard implementation of the interface Analyzer. It allows the case to be insensitive and it removes generic word searches.
[lucene].document.Document	A class used to store the various fields used for indexing data that is referenced as a document.
[lucene].document.Field	A class used to store an individual key value pair that represents a field that is part of a document.
[lucene].search.Query	A class used to store a query that will be applied against an index. The query is independent from an index and hence could be stored for future reference.
[lucene].search.Searcher	An abstract class used to describe a generic searchable class instance.
[lucene].search.Hits	A class that contains the results from running a query on a Searcher class instance.
[lucene].queryParser.QueryParser	A factory class used to instantiate a Query class instance.

Details on Building the Jakarta Lucene Program

Building the Jakarta Lucene program from the sources is not that complicated. The only complication is that you need to download the JavaCC.zip package, a Java compiler. Then, you need to define the correct configuration in the default.properties file. Specifically, in the default.properties file, you need to set the settings in Listing 5.48.

Listing 5.48

 # Home directory of JavaCC javacc.home = /home/cgross/jars javacc.zip.dir = ${javacc.home}/lib javacc.zip = ${javacc.zip.dir}/JavaCC.zip

In Listing 5.48, the javacc.home directory is set to the /home/cgross/jars directory. However, when the JavaCC package is executed, the JavaCC.zip file is searched for in the directory /home/cgross/jars/lib . For those of you who are Windows programmers, the JavaCC.zip file is not the same as the downloaded JavaCC.zip file that contains the JavaCC.jar file and appropriate Java Doc Files (documentation files). You will need the JavaCC.zip file, which you can create by unpacking the JavaCC.jar file and creating a JavaCC.zip file.

Once a compile has been successfully executed using Ant, you will have two jar files: lucene-*.jar and lucene-demos-*.jar . You need to add these two files to the class path. However, add these files to the root class path and not to a configuration file. Doing so will make testing and development much simpler.

Indexing a Document

When you use the Lucene database system, data is indexed and a user can query for it, much like with a search engine. In a traditional relational database, when a piece of data is added to a table, most likely indexing will occur automatically. The data and its associated index is stored transparently from the end user. In the Lucene database, it is not transparent and requires some careful consideration because of the nature of the data managed by Lucene . The nature of the data is diverse, and therefore each document has its own characteristics on how it should be indexed and queried for.

When you index a document using Lucene , you need to define a document and consider what is important in the document. The important parts of the document will be the parts that are indexed. To build some continuity from the betwixt package section, we will use the sample serialized bean XML files as the basis for building the documents. We will start with the simplest bean. We will use the class BeanToWrite , shown in Listing 5.13, as a basis for indexing. Listing 5.49 is an example of how the class BeanToWrite can be indexed.

Listing 5.49

 public void indexBeanToWrite( boolean isNewIndex, String location, BeanToWrite bean) { try { saveBean( location, bean); Analyzer analyzer = new StandardAnalyzer(); IndexWriter writer = new IndexWriter( _indexDir, analyzer, isNewIndex); Document document= new Document(); document.add( Field.UnIndexed( "path", location)); document.add( Field.UnIndexed( "class", bean.getClass().getName())); document.add( Field.Text( "integerValue", String.valueOf( bean.getIntegerValue()))); writer.addDocument(document); writer.close(); } catch( Exception ex) { System.out.println( "****************"); System.out.println( ex.getMessage()); ex.printStackTrace(); } }

In Listing 5.49, the method indexBeanToWrite has three parameters. The first parameter isNewIndex is used to either create a new index or add to an already existing one. The second parameter, location , is used to define the location where the serialized bean will be saved. The third and last parameter, bean , is the bean that will serialized and indexed.

After the method indexBeanToWrite has been called, the method saveBean is called, which serializes the Java Bean to a file using the betwixt serialization routines similar to the examples that we have seen in the previous sections of this chapter. As far as the Lucene database, however, you don't need to take this step of serializing to a file because the Lucene database does not tie in to the betwixt package. The data could have been generated using an XML editing tool, and the Lucene database would not have cared either way.

When a piece of information is indexed in Lucene , there are three parts to make an index work. The first part is the definition of an analyzer, which in Listing 5.49 is the class StandardAnalyzer . The analyzer is used to decipher the text that is to be indexed. For example, the standard analyzer will tokenize words, remove generic words, and convert them to lowercase to enable case-insensitive searches. The purpose of the tokenizer is to split the words apart into single words that can be indexed. Removing the generic words (more commonly called stop words ) involves getting rid of words like "a," "this," and "or" because these words tend to return a large number of unapplicable results when you do a search. Some analyzers are available in the org.apache.lucene.analysis package.

The second part of indexing a document is to create a writer, which in Listing 5.49 is the class IndexWriter . The class IndexWriter is responsible for writing and managing the index.

The last part of indexing a document is to create a document, which in Listing 5.49 is the class Document . The purpose of the class Document is to define a document that is to be indexed. You define a document by creating a number of key value pairs that define a field that is added to the document using the method add. In Listing 5.49, there are three different fields for the index. We will provide the exact specifics of the fields in a later section ("Defining the Different Field Types").

Once all three parts have been defined, the class method writer.addDocument is called to add the document to the index. To write the index, the class method writer.close is called, and the index is organized and written to the hard disk.

Querying for a Document

Once an index has been created and written, it can be queried. Querying an index is shown in Listing 5.50.

Listing 5.50

 Searcher searcher = new IndexSearcher( _indexDir); Query query = QueryParser.parse( String.valueOf( value), "integerValue", new StandardAnalyzer()); Hits hits = searcher.search(query); Vector vector = new Vector(); for (int i=0; i<hits.length(); i++) { System.out.println( "Class identifier " + hits.doc(i).get("class")); vector.add( readBean( hits.doc(i).get("path"), BeanToWrite.class)); };

In Listing 5.50, querying an index is simpler than creating or adding to an index and requires only two parts. The first part, performed by the class IndexSearcher , is used to query an index. In Listing 5.50, the IndexSearcher instantiation has a constructor, which is the directory where the index is located. The Lucene database can work either with directories or in memory indices. However, to keep it simple, we will use the directory-based index.

The second part, which is defined by the class Query , is the creation of the query. To create a class instance of the class Query , the class method QueryParser.parse is called. The class method requires two pieces of information: the query and the analyzer used to create the query. It is very important that you use the same analyzer in the query as you did to create the index. Using different analyzers will result in interpretation errors. For example, consider if one analyzer is case sensitive and another is not.

Once the two parts have been created, they are combined using the method searcher.search , which returns a class instance of Hits . The class Hits represents the results of the query. The class Hits contains a result set of class type Documents . In a relational environment, this would be a standard result set; however, in the Lucene database, there is an associated score. The scoring of a document result indicates how closely a match has been made. In Listing 5.50, the result set of documents is then used to read a number of beans written in the indexing step.

Defining a Database Strategy

We on purpose " forgot " to mention why things were done the way they were in Listings 5.49 and 5.50. The reason is that for you to be able to fully comprehend how to save and index XML documents, or to understand the Lucene database strategy, the mechanics of Lucene need explanations . All of the following sections attempt to explain how to interact with the Lucene database and what is possible and not possible.

In Listing 5.49, three fields were saved to the Lucene database. The first field was defined by the code Field.Unindexed( "path", location) . In this case, the purpose of the field is to define the location of the serialized XML file. The XML file could be stored within the field, or it could be stored within a file or even within a SQL table. The idea is to define a unique reference where the content of the serialized bean can be retrieved. In Internet terms, consider it as a Uniform Resource Identifier (URI) or URL.

The second field was defined by the code Field.UnIndexed( "class", bean.getClass().getName()) . In this case, the purpose of the field is to define the class name. Knowing the class name , the calling platform could dynamically load the class and then register the class descriptor with the betwixt platform.

The third and last field was defined by the code Field.Text( "integerValue", String.valueOf( bean.getIntegerValue()) . The purpose of this field is to provide a field that is indexed and can be searched on.

The three fields each define a unique purpose that should be reflected in any business application. The first and second fields will have values specific to the technology used and will typically not be indexed. The last field changes from application to application. That type of field might not just be a single field but multiple fields.

When you create an index, like in Listing 5.49, you use a specific directory. In a relational database, the management of the indices is transparent to the end programmer. The Lucene database is partially transparent with its indices. While there is no requirement to individually manage the index, not all indices should be stored in one directory. Each individual index should reflect a business task. However, knowing how to split the various indices is not an exact science and should be experimented with.

Defining the Different Field Types

When you define a field, like in Listing 5.49, there are different field references. In Listing 5.49, two fields were added using the method Field.Unindexed , and the other was added using the method Field.Text . Each type of method ( Unindexed and Text) creates a field with specific characteristics. In the case of the method Text, the field is added to the index and the original text is stored in the database. This means that when a document that references the field defined is retrieved, you can retrieve the original content.

As an example, consider the text " my document is here ". Lucene will split up and index the text. The words my , document , and here will be added to the index. When the words are added to the index, the original block is not added as a field that can be retrieved. If you want to add the text as a block, you must explicitly store it as a block of text.

The following other field types can be manipulated by the Lucene database:

Field.Keyword: This a special field because the text block is not tokenized. This field is very useful for dealing with complex strings like dates, serial numbers , or titles. The purpose of this field is to index content, where the content is unique as a whole and not tokenized. For example, if the field value were "hello dolly ", then the field would not be split into two tokens, but would be kept as one token. If the token were split into two, it would be easier to search for individual tokens. Using this type of field, the entire content of the field can be retrieved and is identical in storage terms to a SQL database column.
Field.UnIndexed: This type of field is used to store blocks as text, but these blocks are not indexed. The purpose of these fields is to provide a quick reference to the actual data, so that you don't have to reference it elsewhere. In the case of Listing 5.49, this field type is used to store reference information.
Field.Unstored: This type of field is used to add information to the index, but the information is not stored for extraction.
Field.Text: This type of field, which is used in many cases, is stored and tokenized in the database. This method has an overloaded version, where the data is not stored in a string but in a class instance of Reader. Using the class Reader, you can add large text fields to an index. The only catch is that the contents held by the class Reader is not stored and acts more like a Field.Unstored type.

Each of the methods defined returns a class instance of Field . In addition, you can tweak the class Field property boost . You can use this property to modify queries and search results. A boosted field increases its priority in the result score that is generated for every query.

Querying Using a Query String

In a relational database, you create a query by using the SQL programming language. With Lucene , instead of SQL, you use a querying language that's not related to SQL. It is more related to an Internet search engine that can search in specific fields. Listing 5.51 shows how to query using a query string, which is very similar to querying using a specific field.

Listing 5.51

 public Vector queryStringBeanToWrite( String strQuery) { try { Searcher searcher = new IndexSearcher( _indexDir); QueryParser queryParser = new QueryParser( "integerValue", new StandardAnalyzer()); queryParser.setOperator(  QueryParser.DEFAULT_OPERATOR_OR); Query query = queryParser.parse(strQuery); Hits hits = searcher.search(query); Vector vector = new Vector(); for(int i=0; i<hits.length(); i++) { System.out.println( "Class identifier " + hits.doc(i).get("class")); vector.add( readBean( hits.doc(i).get("path"), BeanToWrite.class)); }; return vector; } catch( Exception ex) { System.out.println( "****************"); System.out.println( ex.getMessage()); ex.printStackTrace(); } return null; }

In Listing 5.51, instead of using the static class method QueryParser.parse , we instantiated the class QueryParser and then called the method parse. We did this because the analyzer needs to be associated with an instance of QueryParser , and the association is made when the class QueryParser is instantiated. The class Query is instantiated using the method queryParser.query , where the variable strQuery contains a query.

An initial query that could be used to call the method queryStringBeanToWrite is shown in Listing 5.52.

Listing 5.52

 integerValue:1234 OR integerValue:2345

The query in Listing 5.52 finds the documents where the field integerValue has the value of 1234 or where the field integerValue has the value of 2345. In Lucene , a colon separates the fieldname and associated value. Another way to write the query in Listing 5.52 would be Listing 5.53.

Listing 5.53

 integerValue:1234 integerValue:2345

In Listing 5.53, an OR keyword is missing. This is OK because in Listing 5.51, the method call queryParser.setOperator sets the default operator to be an OR to join multiple statements.

Yet another way to write query in Listing 5.52 is Listing 5.54.

Listing 5.54

 1234 2345

In Listing 5.54, there is no OR keyword, nor is there any field descriptor. The query still executes and returns the same results as Listing 5.52 because we set the default field in Listing 5.51. When the class QueryParser was instantiated, the field integerValue was given as the default field identifier.

There is even one more way to write Listing 5.52. This is shown in Listing 5.55.

Listing 5.55

 integerValue:(1234 2345)

Listing 5.55 uses brackets to group the items into a list that asks for the field integerValue to find the values of 1234 or 2345 . The default operator between the two numbers is the OR operator, which could have been explicitly written out.

Having seen all these variations of the same query, we now ask whether or not it is advisable to rely on default field identifiers and operators. The answer is mostly no; reliance on default identifiers or operators may not be a good idea. It is better to uniquely describe each field and identifier. This leads to less confusion when somebody is trying to maintain already existing code or fix bugs .

Up to this point, the example of queries used the OR keyword. Here is a list of all operators you can use when writing queries:

OR : In a result set, this combines the result sets from multiple queries. It is as if the queries separated by the OR are combined to build one result set.
AND : In a result set, this ensures that every item of the result set contains the requested data. Writing integerValue:1234 AND integerValue:2345 would be futile because the field integerValue cannot be both numbers at the same time. The AND operator is typically used in the context of a multiple field query.
NOT : In result set, this ensures that the result set does not contain a specific value. For example, the query integerValue:1234 NOT stringValue:"hello world" says to select all the documents where the field integerValue has a value of 1234 but where the field stringValue cannot have a value of "hello world".
+: In a result set, this ensures that every field has s specific value. For example, the query + integerValue:1234 OR stringValue:"hello world" says to select all of the documents that whose field integerValue has a value of 1234 and whose field stringValue may have a value of > "hello world" .
- : This is like the + operator except that instead of requiring each document to contain the value, this operator says that the result set should not contain the value.
[ ] : This defines a group of values that has a range. For example, the query integerValue:[1000 3000] says that all documents with an integerValue in the range of 1,000 to 3,000 should be included in the result set.
~ : This defines a proximity word search. For example, the query stringValue: Hello~10 says that all documents with a stringValue should be within 10 words of the word Hello .
^ : This defines a boost to a specific term , resulting in a higher ordering in the result set. An example could be integerValue:1234^4 .

The following operators can be performed on the words themselves :

Hello~ : Performs a fuzzy search for a specific word
Hel?o : Performs a search where the question mark represents a single wildcard character
He* or He*o : Performs a search where the asterisk represents a multiple wildcard character