Not So Nitty Gritty

Reading

You build a converter for each type of data element or file you want to search. Your converter constructs a Document Object, full of Fields. Each Field has a name and a value. Your converter runs the value data through an Analyzer, which cleans up the data (for example, making it all lower case, dropping out "stop words" like "and", "it", "the", etc) before adding it to a Document. You can build your own Analyzer, or use one of the two or three Analyzers that comes with Lucene. You also ought to include some important Fields, like the modification date of the file, the file name, etc (more on this later, when we get to displaying search results). You also ought to really think about glomming all of the Field data together and storing it as some sort of "all" Field, so your users can search all Fields at once, if they want.

Spidering

You build code to assemble a list of the data you want to be able to search. One common strategy is to build code to spider your website, building a list of all of the pages of your website. You run each chunk of data (each page, or each entry in your database catalog, or whatever) through your converter, and pump it into an IndexWriter.

Specifically, you open the IndexWriter with a directory name, which is where the IndexWriter will write the index files it creates, then you loop through your list of documents (or you crawl through your document repository, or you iterate through your list of products, etc) and you run each one through the converter and then pump the resulting Document into the IndexWriter. When you're done, you close the IndexWriter.

Queries

You build code to present a form to the user, to allow them to enter their query. Your servlet or swing application takes the query string the user entered and runs it through the QueryParser to get a Query object. There's a tricky bit here - you have to make sure you use an Analyzer with the QueryParser, and you have to make sure it's the same sort of Analyzer that you used when you created the search index. That way the query string gets cleaned up the same way your Documents got cleaned up.

Searching

Once you have your Query (which can really be a whole tree of Query objects, since a complex query can be broken down into multiple subqueries) you feed it into an IndexSearcher. If you're using a long-running application, like a servlet, you should probably instantiate the IndexSearcher ahead of time (with, of course, the name of the directory where the search index files are kept), and keep it around in between requests, to avoid the overhead of reinstantiating it all the time. The IndexSearcher returns a Hits object, which contains a Vector of Lucene Documents.

Displaying

Now you loop through the Hits, displaying each Hit to the user. Remember the file name? This is where you use it - you either display a link to the original file (if it's static information), or if it's dynamically generated information, you make sure you include the right Fields to either regenerate it, or you might even just include all the necessary information in the Document Fields themselves, so you can reconstitute what you need from the Document.

Back to the Tutorial