Jeremy Rayner on java and other stuff.

All | AudioDrama | Chatter | Fun | Groovy | Java | Life

Lucene and Groovy example
Posted on 22 Apr 2005
Lucene in ActionI've just got hold of a copy of the Lucene in Action book by Erik Hatcher and Otis Gospodnetic and thought it would be fun to see what the examples of basic Lucene usage would look like in Groovy.

The Groovy code is used in the following manner, with my example using some free classic books from Project Gutenberg to search inside.

$ mkdir bookIndex
$ groovy -cp lucene-1.4.3.jar Indexer bookIndex ~/gutenberg
Indexing~/gutenberg/Bram Stoker/Dracula.txt
Indexing ~/gutenberg/H. G. Wells/The War of the Worlds.txt
Indexing ~/gutenberg/Mark Twain/Adventures of Tom Sawyer.txt
Indexing ~/gutenberg/Oscar Wilde/The Picture of Dorian Gray.txt
Indexing 4 files took 2320 milliseconds
$ groovy -cp lucene-1.4.3.jar Searcher bookIndex indefatigable
Found 1 document(s) (in 30 milliseconds) that matched query 'indefatigable':
/Users/j6wbs/gutenberg/H. G. Wells/The War of the Worlds.txt
$

The first example is a script that will build an inverted index from text files on your hard disc.

Usage: groovy -cp lucene-1.4.3.jar Indexer <index.dir> <text.files.dir>

Indexer.groovy (download)

import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.index.IndexWriter

/** * Indexer: traverses a file system and indexes .txt files * * @author Jeremy Rayner <groovy@ross-rayner.com> * based on examples in the wonderful 'Lucene in Action' book * by Erik Hatcher and Otis Gospodnetic ( http://www.lucenebook.com ) * * requires a lucene-1.x.x.jar from http://lucene.apache.org */

if (args.size() != 2 ) { throw new Exception( "Usage: groovy -cp lucene-1.4.3.jar Indexer <index dir> <data dir>") } def indexDir = new File(args[0]) // Create Lucene index in this directory def dataDir = new File(args[1]) // Index files in this directory

def start = new Date().time def numIndexed = index(indexDir, dataDir) def end = new Date().time

println "Indexing $numIndexed files took ${end - start} milliseconds"

def index(indexDir, dataDir) { if (!dataDir.exists() || !dataDir.directory) { throw new IOException("$dataDir does not exist or is not a directory") } def writer = new IndexWriter( indexDir, new StandardAnalyzer(), true) // Create Lucene index writer.useCompoundFile = false

dataDir.eachFileRecurse { if (it.name =~ /.txt$/) { // Index .txt files only indexFile(writer,it) } } def numIndexed = writer.docCount() writer.optimize() writer.close() // Close index return numIndexed }

void indexFile(writer, f) { if (f.hidden || !f.exists() || !f.canRead() || f.directory) { return }

println "Indexing $f.canonicalPath" def doc = new Document()

// Construct a Field that is tokenized and indexed, // but is not stored in the index verbatim. doc.add(Field.Text("contents", new FileReader(f)))

// Construct a Field that is not tokenized, but is indexed and stored. doc.add(Field.Keyword("filename",f.canonicalPath))

writer.addDocument(doc) // Add document to Lucene index }

The second example builds upon the first by providing a command line tool to search the index of text files.

Usage: groovy -cp lucene-1.4.3.jar Searcher <index.dir> <your.query>

Searcher.groovy (download)

import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.queryParser.QueryParser
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.store.FSDirectory

/** * Searcher: searches a Lucene index for a query passed as an argument * * @author Jeremy Rayner <groovy@ross-rayner.com> * based on examples in the wonderful 'Lucene in Action' book * by Erik Hatcher and Otis Gospodnetic ( http://www.lucenebook.com ) * * requires a lucene-1.x.x.jar from http://lucene.apache.org */

if (args.size() != 2) { throw new Exception( "Usage: groovy -cp lucene-1.4.3.jar Searcher <index dir> <query>") } def indexDir = new File(args[0]) // Index directory create by Indexer def q = args[1] // Query string

if (!indexDir.exists() || !indexDir.directory) { throw new Exception("$indexDir does not exist or is not a directory") }

def fsDir = FSDirectory.getDirectory(indexDir, false) def is = new IndexSearcher(fsDir) // Open index

def query = QueryParser.parse(q, "contents", new StandardAnalyzer()) // Parse query def start = new Date().time def hits = is.search(query) // Search index def end = new Date().time

println "Found ${hits.length()} document(s) " println "(in ${end - start} milliseconds) that matched query '$q':"

for ( i in 0 ..< hits.length() ) { println(hits.doc(i)["filename"]) // Retrieve matching document and display filename }

Further improvements to these scripts could be made in the future by providing groovy wrappers around common Lucene activities. This would allow you to supply the domain specific work inside a closure to convenience methods, e.g. lucene.write(dir) {...} Here is an idea of what it could look like (the following will not work... yet)

...
def index(indexDir, dataDir) {
    if (!dataDir.exists() || !dataDir.directory) {
        throw new IOException(
          "$dataDir does not exist or is not a directory")
    }
    def lucene = Lucene.newInstance()
    def numIndexed = lucene.write(indexDir) {writer->
        dataDir.eachFileRecurse {file->
            if (file.name =~ /.txt$/) { // Index .txt files only
                indexFile(writer,file)
            }
        }
    }
    return numIndexed
}
...

My thanks to Erik and Otis for allowing me to make their examples more Groovy.

22 Apr 2005 |

 

 
April 2005
SunMonTueWedThuFriSat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Prev | Today | Next

rss:subscribe (All)



What I'm reading
my feed aggregator ->box

My websites
London Java Meetups
Programming Projects
Elite in Java
megg
Blogmento
Jez's Photos
Fantasy Stock Market
Cool Saxophonist
Doctor Who Audios
Pisces Audios

Other Blogs
Mike Cannon-Brookes
James Strachan
Joe Walnes
Sam Dalton
Simon Brown
Cameron Purdy
Mike Roberts
Erik C. Thauvin
John Martin
Manfred Riem

B5 d++ t++ k s+ u- f
i+ o+ x-- e+ l- c--

powered by blogmento