Indexing and Searching Text Files using Lucene


Lucene is an open-source full-text search library written in Java, which makes it easy to add search functionality to an application or website.

In this post, I will try to give a quick overview of how Indexing and Searching can be done very easily. I am using Lucene 4.2.1 for the code example.

We will specify a folder which holds our text files which needs to be indexed by Lucene, and then another folder which will actually store the index files (Lucene also supports storing index into database tables, which we will cover in another post).

Let’s take a quick glance through the code to index the files. I have added sufficient comments to understand this code snippet, however feel free to ask your questions in comments if you have any doubts.

public class IndexFiles {
public static final String FILES_TO_INDEX_DIRECTORY = "/Users/jay/Downloads/lucene/docs/";
public static final String INDEX_DIRECTORY = "/Users/jay/Downloads/lucene/index/";
public static final boolean CREATE_INDEX = true; // true: drop existing index and create new one
// false: add new documents to existing index

public static void Index() {
final File docDir = new File(FILES_TO_INDEX_DIRECTORY);
if (!docDir.exists() || !docDir.canRead()) {
System.out
.println("Document directory '"
+ docDir.getAbsolutePath()
+ "' does not exist or is not readable, please check the path");
System.exit(1);
}

Date start = new Date();
try {
System.out.println("Indexing to directory '" + INDEX_DIRECTORY + "'...");

Directory dir = FSDirectory.open(new File(INDEX_DIRECTORY));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_42,
analyzer);

if (CREATE_INDEX) {
// Create a new index in the directory, removing any
// previously indexed documents:
iwc.setOpenMode(OpenMode.CREATE);
} else {
// Add new documents to an existing index:
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
}

IndexWriter writer = new IndexWriter(dir, iwc);
indexDocs(writer, docDir);

writer.close();

Date end = new Date();
System.out.println(end.getTime() - start.getTime() + " total milliseconds");

} catch (IOException e) {
System.out.println(" caught a " + e.getClass()
+ "\n with message: " + e.getMessage());
}
}

static void indexDocs(IndexWriter writer, File file) throws IOException {
if (file.canRead()) {
if (file.isDirectory()) {
String[] files = file.list();
// an IO error could occur
if (files != null) {
for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
}
}
} else {
FileInputStream fis;
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
// at least on windows, some temporary files raise this
// exception with an ";access denied" message
// checking if the file can be read doesn't help
return;
}

try {
// make a new, empty document
Document doc = new Document();

Field pathField = new StringField("path", file.getPath(),
Field.Store.YES);
doc.add(pathField);
doc.add(new TextField("filename",file.getName(), Field.Store.YES));

doc.add(new LongField("modified", file.lastModified(),
Field.Store.NO));

doc.add(new TextField("contents", new BufferedReader(
new InputStreamReader(fis, "UTF-8"))));

if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
// New index, so we just add the document (no old
// document can be there):
System.out.println(&quot;adding &quot; + file);
writer.addDocument(doc);
} else {
// Existing index (an old copy of this document may have
// been indexed) so
// we use updateDocument instead to replace the old one
// matching the exact
// path, if present:
System.out.println(&quot;updating &quot; + file);
writer.updateDocument(new Term(&quot;path&quot;, file.getPath()),
doc);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
fis.close();
}
}
}
}

/**
* @param args
*/
public static void main(String[] args) {
Index();
}

}

As you can see, I am reading the text files from the directory specified, and then parsing contents of each of them one by one and adding to the lucene index. For this example I am using following fields for index:

  1. “path”: which is file path, and is defined as StringField
  2. “filename”: text file name, defined as TextField
  3. “modified”: last modified date of file, and is defined as LongField
  4. “contents”: actual contents of text file, and is defined as TextField

If you notice, we are adding these files to Document object, which is nothing but a set of fields and is considered a unit of indexing and search. Each field has a name and a textual value.

Field may be stored in the document (in which case it can be returned with search hits on the document). Fields which are not stored are not available in documents retrieved from index.

We have used following classes to perform simplest indexing procedure:

  1. IndexWriter creates a new index and adds documents to an existing index
  2. Directory represents location of index. Subclass: FSDirectory (other is RAMDirectory which stores index in RAM)
  3. Analyzer extracts tokens out of text to be indexed and eliminates rest of the content
  4. Document a collection of fields
  5. Field corresponds to a piece of data that is either queried against or retrieved from index during search

Once documents are indexed, we should close the IndexWriter object using

writer.close();

otherwise when you try to create new index in the same path, lucene will throw exception telling that index is locked and cannot write to it.

In my next post, we will see how this index can be used for searching contents.

Advertisements
About

An all round Software Engineer with 8 years of software development experience looking to work on exciting projects with exciting clients. Full of energy, experience, hard work and enthusiasm, I need a new challenge and want to work in the heart of the web software industry.

Tagged with: , , , ,
Posted in Java, Lucene
2 comments on “Indexing and Searching Text Files using Lucene
  1. Anonymous says:

    But in 4.2.1 StandardAnalyzer is not resolved to a type.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Visit My Website
Posts
Categories

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 136 other followers

%d bloggers like this: