Monday, June 4, 2012

Apache Lucene Tutorial: Lucene for Text Search

Overview:
This article looks at how Apache Lucene can be used to perform text based searching. Lucene provides a high-performance text based search capabilities. This is a very easy to use library.

This article applies to Lucene 3.6 (latest release at the time of writing).

The sample code can be found here.

You may also refer to,
Apache Lucene Tutorial: Indexing PDF Files
Apache Lucene Tutorial: Indexing Microsoft Documents

Project Structure:

 org.fazlan.lucene.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- Main.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- index  

Step 1: Creating a Maven Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false  

Step 2: Updated Maven Dependency (pom.xml)

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  <modelVersion>4.0.0</modelVersion>  
  <groupId>org.fazlan</groupId>  
  <artifactId>org.fazlan.lucene.demo</artifactId>  
  <packaging>jar</packaging>  
  <version>1.0-SNAPSHOT</version>  
  <name>org.fazlan.lucene.demo</name>  
  <url>http://maven.apache.org</url>  
  <dependencies>  
   <dependency>  
      <groupId>org.apache.lucene</groupId>  
      <artifactId>lucene-core</artifactId>  
      <version>3.6.0</version>  
   </dependency>  
   <dependency>  
    <groupId>junit</groupId>  
    <artifactId>junit</artifactId>  
    <version>3.8.1</version>  
    <scope>test</scope>  
   </dependency>  
  </dependencies>  
 </project>  

Step 3: Defining the POJO Class used to Index Items


 package org.fazlan.lucene.demo;  
 public class IndexItem {  
   private Long id;  
   private String title;  
   private String content;  
   public static final String ID = "id";  
   public static final String TITLE = "title";  
   public static final String CONTENT = "content";  
   public IndexItem(Long id, String title, String content) {  
     this.id = id;  
     this.title = title;  
     this.content = content;  
   }  
   public Long getId() {  
     return id;  
   }  
   public String getTitle() {  
     return title;  
   }  
   public String getContent() {  
     return content;  
   }  
   @Override  
   public String toString() {  
     return "IndexItem{" +  
         "id=" + id +  
         ", title='" + title + '\'' +  
         ", content='" + content + '\'' +  
         '}';  
   }  
 }  

Step 4: Defining the Indexer

 package org.fazlan.lucene.demo;  
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  
 import org.apache.lucene.document.Document;  
 import org.apache.lucene.document.Field;  
 import org.apache.lucene.index.IndexWriter;  
 import org.apache.lucene.index.IndexWriterConfig;  
 import org.apache.lucene.index.Term;  
 import org.apache.lucene.store.FSDirectory;  
 import org.apache.lucene.util.Version;  
 import java.io.File;  
 import java.io.IOException;  
 public class Indexer  
 {  
   private IndexWriter writer;  
   public Indexer(String indexDir) throws IOException {  
     // create the index  
     if(writer == null) {  
         writer = new IndexWriter(FSDirectory.open(  
           new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36)));  
     }  
   }  
   /**   
    * This method will add the items into index  
    */  
   public void index(IndexItem indexItem) throws IOException {  
     // deleting the item, if already exists  
     writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString()));  
     Document doc = new Document();  
     doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));  
     doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED));  
     doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED));  
     // add the document to the index  
     writer.addDocument(doc);  
   }  
   /**  
    * Closing the index  
    */  
   public void close() throws IOException {  
     writer.close();  
   }  
 }  

Field.Store.YES: if you need to store the value, so that the value can be retrieved from the searched result.
Field.Index.ANALYZED: Index the tokens produced by running the field's value through an Analyzer.

Field.Index.NOT_ANALYZED: Index the field's value without using an Analyzer, so it can be searched.

Step 5: Defining the Searcher

 package org.fazlan.lucene.demo;  
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  
 import org.apache.lucene.document.Document;  
 import org.apache.lucene.index.IndexReader;  
 import org.apache.lucene.queryParser.ParseException;  
 import org.apache.lucene.queryParser.QueryParser;  
 import org.apache.lucene.search.*;  
 import org.apache.lucene.store.FSDirectory;  
 import org.apache.lucene.util.Version;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.ArrayList;  
 import java.util.List;  

 public class Searcher {  

   private IndexSearcher searcher;  
   private QueryParser titleQueryParser;  
   private QueryParser contentQueryParser;  

   public Searcher(String indexDir) throws IOException {  
     // open the index directory to search  
     searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File(indexDir))));  
     StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);  
     // defining the query parser to search items by title field.  
     titleQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.TITLE, analyzer);  
     // defining the query parser to search items by content field.  
     contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer);  
   }  

   /**  
    * This method is used to find the indexed items by the title.  
    * @param queryString - the query string to search for  
    */  
   public List<IndexItem> findByTitle(String queryString, int numOfResults) throws ParseException, IOException {  
     // create query from the incoming query string.  
     Query query = titleQueryParser.parse(queryString);  
     // execute the query and get the results  
     ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;  
     List<IndexItem> results = new ArrayList<IndexItem>();  
     // process the results  
     for (ScoreDoc scoreDoc : queryResults) {  
       Document doc = searcher.doc(scoreDoc.doc);  
       results.add(new IndexItem(Long.parseLong(doc.get(IndexItem.ID)), doc.get(IndexItem.TITLE), doc.get(IndexItem  
           .CONTENT)));  
     }  
      return results;  
   }  

   /**  
    * This method is used to find the indexed items by the content.  
    * @param queryString - the query string to search for  
    */  
   public List<IndexItem> findByContent(String queryString, int numOfResults) throws ParseException, IOException {  
     // create query from the incoming query string.  
     Query query = contentQueryParser.parse(queryString);  
     // execute the query and get the results  
     ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;  
     List<IndexItem> results = new ArrayList<IndexItem>();  
     // process the results  
     for (ScoreDoc scoreDoc : queryResults) {  
       Document doc = searcher.doc(scoreDoc.doc);  
       results.add(new IndexItem(Long.parseLong(doc.get(IndexItem.ID)), doc.get(IndexItem.TITLE), doc.get(IndexItem  
           .CONTENT)));  
     }  
      return results;  
   }  

   public void close() throws IOException {  
     searcher.close();  
   }  
 }  

Step 6: The Application using the Indexer and the Searcher

 package org.fazlan.lucene.demo; 
 
 import org.apache.lucene.queryParser.ParseException;  
 import java.io.BufferedReader;  
 import java.io.IOException;  
 import java.io.InputStreamReader;  
 import java.util.List;  

 public class Main {  
   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  
     // the items to be indexed  
     IndexItem[] indexItems = {  
         new IndexItem(1L, "Java in Action", "This is Java in Action Book"),  
         new IndexItem(2L, "Spring in Action", "This is Spring in Action Book"),  
         new IndexItem(3L, "Hibernate in Action", "This is Hibernate in Action Book"),  
         new IndexItem(4L, "SOA in Action", "This is SOA in Action Book"),  
         new IndexItem(5L, "Apache Axis2 in Action", "This is Axis2 in Action Book"),  
         new IndexItem(6L, "Apache CXF in Action", "This is CXF in Action Book"),  
         new IndexItem(7L, "jQuery in Action", "This is jQuery in Action Book")};  

     // creating the indexer and indexing the items  
     Indexer indexer = new Indexer(INDEX_DIR);  
     for (IndexItem indexItem : indexItems) {  
       indexer.index(indexItem);  
     }  

     // close the index to enable them index  
     indexer.close();  

     BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));  
     String input;  
     System.out.println("Type Q/q to quit.");  
     System.out.println("Type 1 query by title.");  
     System.out.println("Type 2 query by content.");  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  

     do {  
       System.out.print("Enter input: ");  
       input = reader.readLine();  
       if (input.equalsIgnoreCase("q")) {  
         break;  
       }  

       // search by title  
       if (input.equals("1")) {  
         System.out.print("Enter title to search: ");  
         input = reader.readLine();  
         List<IndexItem> result = searcher.findByTitle(input, DEFAULT_RESULT_SIZE);  
         print(result);  

       } else if (input.equals("2")) { // else, search by content  
         System.out.print("Enter content to search: ");  
         input = reader.readLine();  
         List<IndexItem> result = searcher.findByContent(input, DEFAULT_RESULT_SIZE);  
         print(result);  
       }  
     } while (true);  

     searcher.close();  
   }  

   /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }  

Summary:
This article looked at how you can easily introduce text based indexing into your application using Apache Lucene.

The sample code can be found here.

3 comments:

  1. Dear Fazlan,

    My name is Julian and I work for a publishers called Packt Publishing. We publish books for all levels of I.T. users across Enterprise and Open Source software (www.packtpub.com if you would like to have a look).

    We currently have a book in development called ‘Lucene 4 Cookbook’. This book will hopefully be written for, and targeted towards Java application developers who develop any applications that use data and want to integrate the Lucene search engine. They will have a strong understanding of Java, but we only expect a very basic knowledge of Lucene will be required. The book will use the ‘cookbook’ format, showing readers how to develop practical skills through learning the library and the features that it contains. It will give developers a tutorial guide for achieving this, and show them best practices and other inspiration on how to achieve tasks quickly and effectively.

    I was wondering if you would be interested in discussing the opportunity to author this title for us? I’ve seen from your involvement in the Lucene community and your blog that you are very knowledgeable about this area, and would be delighted if you did want to discuss this opportunity further.

    I look forward to hearing from you soon,

    Kind regards,

    Julian Ursell

    ITCCE
    [Packt Publishing]

    ReplyDelete
    Replies
    1. Hi Julian,

      Firstly, thank you very much for your interest and your request. I appreciate a lot.

      Yes, absolutely, it's a pleasure and that's fantastic! I look forward to discuss about the opportunity further.

      Cheers,
      Fazlan

      Delete