javanicus

Lucene and Groovy example

Posted on 22 Apr 2005

I've just got hold of a copy of the Lucene in Action book by Erik Hatcher and Otis Gospodnetic and thought it would be fun to see what the examples of basic Lucene usage would look like in Groovy.

The Groovy code is used in the following manner, with my example using some free classic books from Project Gutenberg to search inside.

$ mkdir bookIndex
$ groovy -cp lucene-1.4.3.jar Indexer bookIndex ~/gutenberg
Indexing~/gutenberg/Bram Stoker/Dracula.txt
Indexing ~/gutenberg/H. G. Wells/The War of the Worlds.txt
Indexing ~/gutenberg/Mark Twain/Adventures of Tom Sawyer.txt
Indexing ~/gutenberg/Oscar Wilde/The Picture of Dorian Gray.txt
Indexing 4 files took 2320 milliseconds
$ groovy -cp lucene-1.4.3.jar Searcher bookIndex indefatigable
Found 1 document(s) (in 30 milliseconds) that matched query 'indefatigable':
/Users/j6wbs/gutenberg/H. G. Wells/The War of the Worlds.txt
$

The first example is a script that will build an inverted index from text files on your hard disc.

Usage: groovy -cp lucene-1.4.3.jar Indexer <index.dir> <text.files.dir>

Indexer.groovy (download)

import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.index.IndexWriter /**
 * Indexer: traverses a file system and indexes .txt files
 *
 * @author Jeremy Rayner <groovy@ross-rayner.com>
 * based on examples in the wonderful 'Lucene in Action' book
 * by Erik Hatcher and Otis Gospodnetic ( http://www.lucenebook.com )
 *
 * requires a lucene-1.x.x.jar from http://lucene.apache.org
 */
if (args.size() != 2 ) {
    throw new Exception(
    "Usage: groovy -cp lucene-1.4.3.jar Indexer <index dir> <data dir>")
}
def indexDir = new File(args[0]) // Create Lucene index in  this directory
def dataDir = new File(args[1])  // Index files in this directory
def start = new Date().time
def numIndexed = index(indexDir, dataDir)
def end = new Date().time
println "Indexing $numIndexed files took ${end - start} milliseconds"
def index(indexDir, dataDir) {
    if (!dataDir.exists() || !dataDir.directory) {
        throw new IOException("$dataDir does not exist or is not a directory")
    }
    def writer = new IndexWriter(
        indexDir, new StandardAnalyzer(), true)  // Create Lucene index
    writer.useCompoundFile = false
    dataDir.eachFileRecurse {
        if (it.name =~ /.txt$/) {  // Index .txt files only
            indexFile(writer,it)
        }
    }
    def numIndexed = writer.docCount()
    writer.optimize()
    writer.close()  // Close index
    return numIndexed
}
void indexFile(writer, f) {
    if (f.hidden || !f.exists() || !f.canRead() || f.directory) { return }
    println "Indexing $f.canonicalPath"
    def doc = new Document()
     // Construct a Field that is tokenized and indexed, 
    // but is not stored in the index verbatim.
    doc.add(Field.Text("contents", new FileReader(f)))
     // Construct a Field that is not tokenized, but is indexed and stored.
    doc.add(Field.Keyword("filename",f.canonicalPath))
    writer.addDocument(doc)  // Add document to Lucene index
}

The second example builds upon the first by providing a command line tool to search the index of text files.

Usage: groovy -cp lucene-1.4.3.jar Searcher <index.dir> <your.query>

Searcher.groovy (download)

import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.queryParser.QueryParser
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.store.FSDirectory /**
 * Searcher: searches a Lucene index for a query passed as an argument
 *
 * @author Jeremy Rayner <groovy@ross-rayner.com>
 * based on examples in the wonderful 'Lucene in Action' book
 * by Erik Hatcher and Otis Gospodnetic ( http://www.lucenebook.com )
 *
 * requires a lucene-1.x.x.jar from http://lucene.apache.org
 */
if (args.size() != 2) {
    throw new Exception(
    "Usage: groovy -cp lucene-1.4.3.jar Searcher <index dir> <query>")
}
def indexDir = new File(args[0])  // Index directory create by Indexer
def q = args[1]  // Query string
if (!indexDir.exists() || !indexDir.directory) {
    throw new Exception("$indexDir does not exist or is not a directory")
}
def fsDir = FSDirectory.getDirectory(indexDir, false)
def is = new IndexSearcher(fsDir)  // Open index
def query = QueryParser.parse(q, "contents", new StandardAnalyzer())  // Parse query
def start = new Date().time
def hits = is.search(query)  // Search index
def end = new Date().time
println "Found ${hits.length()} document(s) "
println "(in ${end - start} milliseconds) that matched query '$q':"
for ( i in 0 ..< hits.length() ) {
    println(hits.doc(i)["filename"])  // Retrieve matching document and display filename
}

Further improvements to these scripts could be made in the future by providing groovy wrappers around common Lucene activities. This would allow you to supply the domain specific work inside a closure to convenience methods, e.g. lucene.write(dir) {...} Here is an idea of what it could look like (the following will not work... yet)

...
def index(indexDir, dataDir) {
    if (!dataDir.exists() || !dataDir.directory) {
        throw new IOException(
          "$dataDir does not exist or is not a directory")
    }
    def lucene = Lucene.newInstance()
    def numIndexed = lucene.write(indexDir) {writer->
        dataDir.eachFileRecurse {file->
            if (file.name =~ /.txt$/) { // Index .txt files only
                indexFile(writer,file)
            }
        }
    }
    return numIndexed
}
...

My thanks to Erik and Otis for allowing me to make their examples more Groovy.

22 Apr 2005 |