Fazlan's Blog Spot: June 2012

Sunday, June 17, 2012

How to Enable UTF-8 Support on Tomcat

Overview:
This is a brief article on enabling the support for displaying UTF-8 characters (e.i: Japanese font) on JSP/HTML pages of a web applications on Tomcat. We can achieve this with THREE easy steps.

To make UTF-8 working under Java, Tomcat, Linux/Windows, it requires the following:

Update Tomcat's server.xml
Define a javax.servlet.Filter and Update the Web Application's web.xml
Enable UTF-8 encoding on JSP/HTML

Update Tomcat's server.xml

This handles GET request URL. With this configuration, the Connector uses UTF-8 encoding to handle all incoming GET request parameters.

 <Connector   
         . . .  
         URIEncoding="UTF-8"/>

 http://localhost:8080/foo-app/get?foo_param=こんにちは世界

e.i: request.getParameter("foo_param") // the value retrieved will be encoded with UTF-8 and you'll get the UTF-8 value as it is("こんにちは世界").

IMPORTANT NOTE: POST requests will have NO effect by this change.

Define a javax.servlet.Filter and Update the Web Application's web.xml
Now, we need to enforce our web application to handle all requests and response in terms of UTF-8 encoding. This way, we are handling POST requests as well. For this purpose, we need to define a character set filter that'll transform all the requests and response into UTF-8 encoding in the following manner.

 package org.fazlan.tomcat.ext.filter;
  
 import javax.servlet.Filter;  
 import javax.servlet.FilterChain;  
 import javax.servlet.FilterConfig;  
 import javax.servlet.ServletException;  
 import javax.servlet.ServletRequest;  
 import javax.servlet.ServletResponse;  
 import java.io.IOException;  

 /***  
  * This is a filter class to force the java webapp to handle all requests and responses as UTF-8 encoded by default.  
  * This requires that we define a character set filter.  
  * This filter makes sure that if the browser hasn't set the encoding used in the request, that it's set to UTF-8.  
  */  
 public class CharacterSetFilter implements Filter {  

   private static final String UTF8 = "UTF-8";  
   private static final String CONTENT_TYPE = "text/html; charset=UTF-8";  
   private String encoding;  

   @Override  
   public void init(FilterConfig config) throws ServletException {  
     encoding = config.getInitParameter("requestCharEncoding");  
     if (encoding == null) {  
       encoding = UTF8;  
     }  
   }  

   @Override  
   public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {  
     // Honour the client-specified character encoding  
     if (null == request.getCharacterEncoding()) {  
       request.setCharacterEncoding(encoding);  
     }  
     /**  
      * Set the default response content type and encoding  
      */  
     response.setContentType(CONTENT_TYPE);  
     response.setCharacterEncoding(UTF8);  
     chain.doFilter(request, response);  
   }  

   @Override  
   public void destroy() {  
   }  
 }

The filter ensures that if the browser has not set the encoding format in the request, UTF-8 is set as the default encoding. Also, it sets UTF-8 as the default response encoding.

Now, we need to add this to our web application's web.xml to make it work.

 . . .
 <filter>  
   <filter-name>CharacterSetFilter</filter-name>  
   <filter-class>org.fazlan.tomcat.ext.filter.CharacterSetFilter</filter-class>  
   <init-param>  
     <param-name>requestEncoding</param-name>  
     <param-value>UTF-8</param-value>  
   </init-param>  
 </filter>  
 <filter-mapping>  
   <filter-name>CharacterSetFilter</filter-name>  
   <url-pattern>/*</url-pattern>  
 </filter-mapping>  
 . . .

Enable UTF-8 encoding on JSP/HTML
JSP Pages
All JSP pages that needs to render UTF-8 content needs to have the following on top the page declaration.

 <%@ page contentType="text/html;charset=UTF-8" language="java" pageEncoding="UTF-8" %>

HTML Pages
All HTML pages that needs to render UTF-8 content needs to have the following in their header section.

 <?xml version="1.0" encoding="UTF-8"?>  
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">  
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">  
 <head>  
 <meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" />  
 ...  
 </head>

Summary:
The article looked at how to support UTF-8 content in your web application deployed on Tomcat.

Thursday, June 7, 2012

Apache Lucene Tutorial: Indexing PDF Files

Overview:
This article is a sequel to Apache Lucene Tutorial: Lucene for Text Search. Here, we look at how to index content in a PDF file. Apache Lucene doesn't have the build-in capability to process PDF files. Therefore, we need to use one of the APIs that enables us to perform text manipulation on PDF files.

One such library is Apache PDFBox, which we'll use in the article. You can read more about Apache PDFBox.

Article applies to Lucene 3.6.0 and PDFBox 0.7.3.

Sample code can be found here.

You may also refer to Apache Lucene Tutorial: Indexing Microsoft Documents

Project Structure:

 org.fazlan.lucene.pdf.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- FileIndexApplication.java  
     |          |-- FileIndexer.java  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- PDFIndexer.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- files  
         `-- HelloPDFBox.pdf

Step 1: Creating the Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.pdf.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false

Step 2: Updating the Maven Dependencies

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  <modelVersion>4.0.0</modelVersion>  
  <groupId>org.fazlan</groupId>  
  <artifactId>org.fazlan.lucene.pdf.demo</artifactId>  
  <packaging>jar</packaging>  
  <version>1.0-SNAPSHOT</version>  
  <name>org.fazlan.lucene.demo</name>  
  <url>http://maven.apache.org</url>  
  <dependencies>  
   <dependency>  
      <groupId>org.apache.lucene</groupId>  
      <artifactId>lucene-core</artifactId>  
      <version>3.6.0</version>  
   </dependency>  
    <dependency>  
      <groupId>pdfbox</groupId>  
      <artifactId>pdfbox</artifactId>  
      <version>0.7.3</version>  
    </dependency>  
   <dependency>  
    <groupId>junit</groupId>  
    <artifactId>junit</artifactId>  
    <version>3.8.1</version>  
    <scope>test</scope>  
   </dependency>  
  </dependencies>  
 </project>

Step 3: Defining the PDF Indexer

This is the most important component. The following code will load the content from a PDF file, and the extracted content is form into a String representation so that it can be further processed by Lucene for indexing purposes.

 package org.fazlan.lucene.demo;  
 import org.pdfbox.pdmodel.PDDocument;  
 import org.pdfbox.util.PDFTextStripper;  
 import java.io.File;  
 import java.io.IOException;  
 public class PDFIndexer implements FileIndexer {  
   public IndexItem index(File file) throws IOException {  
     PDDocument doc = PDDocument.load(file);  
     String content = new PDFTextStripper().getText(doc);  
     doc.close();  
     return new IndexItem((long)file.getName().hashCode(), file.getName(), content);  
   }  
 }

Step 4: Writing a Application

The following is a sample application code to index a PDF file.

 package org.fazlan.lucene.demo;  

 import org.apache.lucene.queryParser.ParseException;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.List;  

 public class FileIndexApplication {  

   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  

     File pdfFile = new File("src/main/resources/files/HelloPDFBox.pdf");  
     IndexItem pdfIndexItem = new PDFIndexer().index(pdfFile);  
     // creating the indexer and indexing the items  

     Indexer indexer = new Indexer(INDEX_DIR);  
     indexer.index(pdfIndexItem);  

     // close the index to enable them index  
     indexer.close();  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  
     List<IndexItem> result = searcher.findByContent("World", DEFAULT_RESULT_SIZE);  
     print(result);  

     searcher.close();  
   }  
    /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }

Summary:
This was a brief article on how to integrate Apache PDFBox with Apache Lucene for indexing the contents in a PDF file.

Sample code can be found here.

Apache Lucene Tutorial: Indexing Microsoft Documents

Overview:
This article is a sequel to Apache Lucene Tutorial: Lucene for Text Search. Here, we look at how to index content in a Microsoft documents such as Word, Excel and PowerPoint files. Apache Lucene doesn't have the build-in capability to process these files. Therefore, we need to use one of the APIs that enables us to perform text manipulation on MS documents files.

One such library is Apache POI, which we'll use in the article. You can read more about Apache POI.

Article applies to Lucene 3.6.0 and POI 3.8.0.

Sample code can be found here.

You may also refer to Apache Lucene Tutorial: Indexing PDF Files

Project Structure:

 org.fazlan.lucene.ms.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- FileIndexApplication.java  
     |          |-- FileIndexer.java  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- MSDocumentIndexer.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- files  
         |-- MSExcell.xls  
         |-- MSWord.doc  
         `-- MSWord.docx

Step 1: Creating the Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.ms.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false

Step 2: Updating the Maven Dependencies

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
   <modelVersion>4.0.0</modelVersion>  
   <groupId>org.fazlan</groupId>  
   <artifactId>org.fazlan.lucene.ms.demo</artifactId>  
   <packaging>jar</packaging>  
   <version>1.0-SNAPSHOT</version>  
   <name>org.fazlan.lucene.demo</name>  
   <url>http://maven.apache.org</url>  
   <dependencies>  
     <dependency>  
       <groupId>org.apache.lucene</groupId>  
       <artifactId>lucene-core</artifactId>  
       <version>3.6.0</version>  
     </dependency>  
     <dependency>  
       <groupId>org.apache.poi</groupId>  
       <artifactId>poi</artifactId>  
       <version>3.8</version>  
     </dependency>  
     <dependency>  
       <groupId>org.apache.poi</groupId>  
       <artifactId>poi-ooxml</artifactId>  
       <version>3.8</version>  
     </dependency>  
     <dependency>  
       <groupId>org.apache.poi</groupId>  
       <artifactId>poi-scratchpad</artifactId>  
       <version>3.8</version>  
     </dependency>  
     <dependency>  
       <groupId>junit</groupId>  
       <artifactId>junit</artifactId>  
       <version>3.8.1</version>  
       <scope>test</scope>  
     </dependency>  
   </dependencies>  
 </project>

Step 3: Defining the MS Document Indexer

This is the most important component. The following code will load the content from a MS Word, MS Excel, MS PowerPoint or Visio file, and the extracted content is form into a String representation so that it can be further processed by Lucene for indexing purposes.

Extractors already exist for Excel, Word, PowerPoint and Visio; if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it based on the file extension.

 package org.fazlan.lucene.demo;  

 import org.apache.poi.extractor.ExtractorFactory;  
 import java.io.File;  
 import java.io.IOException;  

 public class MSDocumentIndexer implements FileIndexer {  

   public IndexItem index(File file) throws IOException {  

     String content = "";  
     try {  
       content = ExtractorFactory.createExtractor(file).getText();  
     } catch (Exception e) {  
       e.printStackTrace();  
     }  

     return new IndexItem((long) file.hashCode(), file.getName(), content);  
   }  
 }

The above code uses a new class in POI 3.5 or higher, org.apache.poi.extractor.ExtractorFactory provides a similar function to WorkbookFactory. You simply pass it an InputStream, a file, a POIFSFileSystem or a OOXML Package. It figures out the correct text extractor for you, and returns it.

Using POI < 3.5 to Extracting Text Content

Extracting Content from MS Word Document

 //MS Word  
 POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("MS Word.doc"));   
 WordExtractor extractor = new WordExtractor(fs);   
 String content = extractor.getText();

Extracting Content from MS Excel Document

 //MS Excel  
 POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("MS Excel.xls"));   
 ExcelExtractor extractor = new ExcelExtractor(fs);   
 String content = extractor.getText();

Extracting Content from MS PowerPoint Document

 /MS PowerPoint  
 POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("MS PowerPoint.ppt"));   
 PowerPointExtractor extractor = new PowerPointExtractor(fs);   
 String content = extractor.getText();

Step 4: Writing a Application

The following is a sample application code to index a MS document file.

 package org.fazlan.lucene.demo;  

 import org.apache.lucene.queryParser.ParseException;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.List;  

 public class FileIndexApplication {  

   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  

     MSDocumentIndexer msDocumentIndexer = new MSDocumentIndexer();  
     File msWordFile = new File("src/main/resources/files/MSWord.doc");  
     File msWord2003File = new File("src/main/resources/files/MSWord.docx");  
     File msExcellFile = new File("src/main/resources/files/MSExcell.xls");  

     // creating the indexer and indexing the items  
     Indexer indexer = new Indexer(INDEX_DIR);  
     indexer.index(msDocumentIndexer.index(msWordFile));  
     indexer.index(msDocumentIndexer.index(msWord2003File));  
     indexer.index(msDocumentIndexer.index(msExcellFile));  

     // close the index to enable them index  
     indexer.close();  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  
     List<IndexItem> result = searcher.findByContent("Microfost", DEFAULT_RESULT_SIZE);  
     print(result);  

     searcher.close();  
   }  

    /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }

Summary:
This was a brief article on how to integrate Apache POI with Apache Lucene for indexing the contents in a MS documents such as Word, Excel and PowerPoint file.

Sample code can be found here.

Monday, June 4, 2012

Apache Lucene Tutorial: Lucene for Text Search

Overview:
This article looks at how Apache Lucene can be used to perform text based searching. Lucene provides a high-performance text based search capabilities. This is a very easy to use library.

This article applies to Lucene 3.6 (latest release at the time of writing).

The sample code can be found here.

You may also refer to,
Apache Lucene Tutorial: Indexing PDF Files
Apache Lucene Tutorial: Indexing Microsoft Documents

Project Structure:

 org.fazlan.lucene.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- Main.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- index

Step 1: Creating a Maven Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false

Step 2: Updated Maven Dependency (pom.xml)

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  <modelVersion>4.0.0</modelVersion>  
  <groupId>org.fazlan</groupId>  
  <artifactId>org.fazlan.lucene.demo</artifactId>  
  <packaging>jar</packaging>  
  <version>1.0-SNAPSHOT</version>  
  <name>org.fazlan.lucene.demo</name>  
  <url>http://maven.apache.org</url>  
  <dependencies>  
   <dependency>  
      <groupId>org.apache.lucene</groupId>  
      <artifactId>lucene-core</artifactId>  
      <version>3.6.0</version>  
   </dependency>  
   <dependency>  
    <groupId>junit</groupId>  
    <artifactId>junit</artifactId>  
    <version>3.8.1</version>  
    <scope>test</scope>  
   </dependency>  
  </dependencies>  
 </project>

Step 3: Defining the POJO Class used to Index Items

 package org.fazlan.lucene.demo;  
 public class IndexItem {  
   private Long id;  
   private String title;  
   private String content;  
   public static final String ID = "id";  
   public static final String TITLE = "title";  
   public static final String CONTENT = "content";  
   public IndexItem(Long id, String title, String content) {  
     this.id = id;  
     this.title = title;  
     this.content = content;  
   }  
   public Long getId() {  
     return id;  
   }  
   public String getTitle() {  
     return title;  
   }  
   public String getContent() {  
     return content;  
   }  
   @Override  
   public String toString() {  
     return "IndexItem{" +  
         "id=" + id +  
         ", title='" + title + '\'' +  
         ", content='" + content + '\'' +  
         '}';  
   }  
 }

Step 4: Defining the Indexer

 package org.fazlan.lucene.demo;  
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  
 import org.apache.lucene.document.Document;  
 import org.apache.lucene.document.Field;  
 import org.apache.lucene.index.IndexWriter;  
 import org.apache.lucene.index.IndexWriterConfig;  
 import org.apache.lucene.index.Term;  
 import org.apache.lucene.store.FSDirectory;  
 import org.apache.lucene.util.Version;  
 import java.io.File;  
 import java.io.IOException;  
 public class Indexer  
 {  
   private IndexWriter writer;  
   public Indexer(String indexDir) throws IOException {  
     // create the index  
     if(writer == null) {  
         writer = new IndexWriter(FSDirectory.open(  
           new File(indexDir)), new IndexWriterConfig(Version.LUCENE_36, new StandardAnalyzer(Version.LUCENE_36)));  
     }  
   }  
   /**   
    * This method will add the items into index  
    */  
   public void index(IndexItem indexItem) throws IOException {  
     // deleting the item, if already exists  
     writer.deleteDocuments(new Term(IndexItem.ID, indexItem.getId().toString()));  
     Document doc = new Document();  
     doc.add(new Field(IndexItem.ID, indexItem.getId().toString(), Field.Store.YES, Field.Index.NOT_ANALYZED));  
     doc.add(new Field(IndexItem.TITLE, indexItem.getTitle(), Field.Store.YES, Field.Index.ANALYZED));  
     doc.add(new Field(IndexItem.CONTENT, indexItem.getContent(), Field.Store.YES, Field.Index.ANALYZED));  
     // add the document to the index  
     writer.addDocument(doc);  
   }  
   /**  
    * Closing the index  
    */  
   public void close() throws IOException {  
     writer.close();  
   }  
 }

Field.Store.YES: if you need to store the value, so that the value can be retrieved from the searched result.
Field.Index.ANALYZED: Index the tokens produced by running the field's value through an Analyzer.

Field.Index.NOT_ANALYZED: Index the field's value without using an Analyzer, so it can be searched.

Step 5: Defining the Searcher

 package org.fazlan.lucene.demo;  
 import org.apache.lucene.analysis.standard.StandardAnalyzer;  
 import org.apache.lucene.document.Document;  
 import org.apache.lucene.index.IndexReader;  
 import org.apache.lucene.queryParser.ParseException;  
 import org.apache.lucene.queryParser.QueryParser;  
 import org.apache.lucene.search.*;  
 import org.apache.lucene.store.FSDirectory;  
 import org.apache.lucene.util.Version;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.ArrayList;  
 import java.util.List;  

 public class Searcher {  

   private IndexSearcher searcher;  
   private QueryParser titleQueryParser;  
   private QueryParser contentQueryParser;  

   public Searcher(String indexDir) throws IOException {  
     // open the index directory to search  
     searcher = new IndexSearcher(IndexReader.open(FSDirectory.open(new File(indexDir))));  
     StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);  
     // defining the query parser to search items by title field.  
     titleQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.TITLE, analyzer);  
     // defining the query parser to search items by content field.  
     contentQueryParser = new QueryParser(Version.LUCENE_36, IndexItem.CONTENT, analyzer);  
   }  

   /**  
    * This method is used to find the indexed items by the title.  
    * @param queryString - the query string to search for  
    */  
   public List<IndexItem> findByTitle(String queryString, int numOfResults) throws ParseException, IOException {  
     // create query from the incoming query string.  
     Query query = titleQueryParser.parse(queryString);  
     // execute the query and get the results  
     ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;  
     List<IndexItem> results = new ArrayList<IndexItem>();  
     // process the results  
     for (ScoreDoc scoreDoc : queryResults) {  
       Document doc = searcher.doc(scoreDoc.doc);  
       results.add(new IndexItem(Long.parseLong(doc.get(IndexItem.ID)), doc.get(IndexItem.TITLE), doc.get(IndexItem  
           .CONTENT)));  
     }  
      return results;  
   }  

   /**  
    * This method is used to find the indexed items by the content.  
    * @param queryString - the query string to search for  
    */  
   public List<IndexItem> findByContent(String queryString, int numOfResults) throws ParseException, IOException {  
     // create query from the incoming query string.  
     Query query = contentQueryParser.parse(queryString);  
     // execute the query and get the results  
     ScoreDoc[] queryResults = searcher.search(query, numOfResults).scoreDocs;  
     List<IndexItem> results = new ArrayList<IndexItem>();  
     // process the results  
     for (ScoreDoc scoreDoc : queryResults) {  
       Document doc = searcher.doc(scoreDoc.doc);  
       results.add(new IndexItem(Long.parseLong(doc.get(IndexItem.ID)), doc.get(IndexItem.TITLE), doc.get(IndexItem  
           .CONTENT)));  
     }  
      return results;  
   }  

   public void close() throws IOException {  
     searcher.close();  
   }  
 }

Step 6: The Application using the Indexer and the Searcher

 package org.fazlan.lucene.demo; 
 
 import org.apache.lucene.queryParser.ParseException;  
 import java.io.BufferedReader;  
 import java.io.IOException;  
 import java.io.InputStreamReader;  
 import java.util.List;  

 public class Main {  
   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  
     // the items to be indexed  
     IndexItem[] indexItems = {  
         new IndexItem(1L, "Java in Action", "This is Java in Action Book"),  
         new IndexItem(2L, "Spring in Action", "This is Spring in Action Book"),  
         new IndexItem(3L, "Hibernate in Action", "This is Hibernate in Action Book"),  
         new IndexItem(4L, "SOA in Action", "This is SOA in Action Book"),  
         new IndexItem(5L, "Apache Axis2 in Action", "This is Axis2 in Action Book"),  
         new IndexItem(6L, "Apache CXF in Action", "This is CXF in Action Book"),  
         new IndexItem(7L, "jQuery in Action", "This is jQuery in Action Book")};  

     // creating the indexer and indexing the items  
     Indexer indexer = new Indexer(INDEX_DIR);  
     for (IndexItem indexItem : indexItems) {  
       indexer.index(indexItem);  
     }  

     // close the index to enable them index  
     indexer.close();  

     BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));  
     String input;  
     System.out.println("Type Q/q to quit.");  
     System.out.println("Type 1 query by title.");  
     System.out.println("Type 2 query by content.");  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  

     do {  
       System.out.print("Enter input: ");  
       input = reader.readLine();  
       if (input.equalsIgnoreCase("q")) {  
         break;  
       }  

       // search by title  
       if (input.equals("1")) {  
         System.out.print("Enter title to search: ");  
         input = reader.readLine();  
         List<IndexItem> result = searcher.findByTitle(input, DEFAULT_RESULT_SIZE);  
         print(result);  

       } else if (input.equals("2")) { // else, search by content  
         System.out.print("Enter content to search: ");  
         input = reader.readLine();  
         List<IndexItem> result = searcher.findByContent(input, DEFAULT_RESULT_SIZE);  
         print(result);  
       }  
     } while (true);  

     searcher.close();  
   }  

   /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }

Summary:
This article looked at how you can easily introduce text based indexing into your application using Apache Lucene.

The sample code can be found here.