The Groovy code is used in the following manner, with my example
using some free classic books from Project Gutenberg
to search inside.
$ mkdir bookIndex
$ groovy -cp lucene-1.4.3.jar Indexer bookIndex ~/gutenberg
Indexing~/gutenberg/Bram Stoker/Dracula.txt
Indexing ~/gutenberg/H. G. Wells/The War of the Worlds.txt
Indexing ~/gutenberg/Mark Twain/Adventures of Tom Sawyer.txt
Indexing ~/gutenberg/Oscar Wilde/The Picture of Dorian Gray.txt
Indexing 4 files took 2320 milliseconds
$ groovy -cp lucene-1.4.3.jar Searcher bookIndex indefatigable
Found 1 document(s) (in 30 milliseconds) that matched query 'indefatigable':
/Users/j6wbs/gutenberg/H. G. Wells/The War of the Worlds.txt
$
|
The first example is a script that will build an inverted index
from text files on your hard disc.
Usage: groovy -cp lucene-1.4.3.jar Indexer <index.dir> <text.files.dir>
|
Indexer.groovy (download)
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.index.IndexWriter if (args.size() != 2 ) {
throw new Exception(
"Usage: groovy -cp lucene-1.4.3.jar Indexer <index dir> <data dir>")
}
def indexDir = new File(args[0])
def dataDir = new File(args[1]) def start = new Date().time
def numIndexed = index(indexDir, dataDir)
def end = new Date().timeprintln "Indexing $numIndexed files took ${end - start} milliseconds"def index(indexDir, dataDir) {
if (!dataDir.exists() || !dataDir.directory) {
throw new IOException("$dataDir does not exist or is not a directory")
}
def writer = new IndexWriter(
indexDir, new StandardAnalyzer(), true)
writer.useCompoundFile = false dataDir.eachFileRecurse {
if (it.name =~ /.txt$/) {
indexFile(writer,it)
}
}
def numIndexed = writer.docCount()
writer.optimize()
writer.close()
return numIndexed
}void indexFile(writer, f) {
if (f.hidden || !f.exists() || !f.canRead() || f.directory) { return } println "Indexing $f.canonicalPath"
def doc = new Document()
doc.add(Field.Text("contents", new FileReader(f)))
doc.add(Field.Keyword("filename",f.canonicalPath)) writer.addDocument(doc)
}
The second example builds upon the first by providing a command line tool to search the index of text files.
Usage: groovy -cp lucene-1.4.3.jar Searcher <index.dir> <your.query>
|
Searcher.groovy (download)
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.queryParser.QueryParser
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.store.FSDirectory if (args.size() != 2) {
throw new Exception(
"Usage: groovy -cp lucene-1.4.3.jar Searcher <index dir> <query>")
}
def indexDir = new File(args[0])
def q = args[1] if (!indexDir.exists() || !indexDir.directory) {
throw new Exception("$indexDir does not exist or is not a directory")
}def fsDir = FSDirectory.getDirectory(indexDir, false)
def is = new IndexSearcher(fsDir) def query = QueryParser.parse(q, "contents", new StandardAnalyzer())
def start = new Date().time
def hits = is.search(query)
def end = new Date().timeprintln "Found ${hits.length()} document(s) "
println "(in ${end - start} milliseconds) that matched query '$q':"for ( i in 0 ..< hits.length() ) {
println(hits.doc(i)["filename"])
}
Further improvements to these scripts could be made in the future by providing groovy wrappers around common Lucene activities.
This would allow you to supply the domain specific work inside a closure to convenience methods, e.g.
Here is an idea of what it could look like (the following will not work... yet)
...
def index(indexDir, dataDir) {
if (!dataDir.exists() || !dataDir.directory) {
throw new IOException(
"$dataDir does not exist or is not a directory")
}
def lucene = Lucene.newInstance()
def numIndexed = lucene.write(indexDir) {writer->
dataDir.eachFileRecurse {file->
if (file.name =~ /.txt$/) {
indexFile(writer,file)
}
}
}
return numIndexed
}
...
My thanks to Erik and Otis for allowing me to make their examples more Groovy.