Thursday, June 7, 2012

Apache Lucene Tutorial: Indexing Microsoft Documents

Overview:
This article is a sequel to Apache Lucene Tutorial: Lucene for Text Search. Here, we look at how to index content in a Microsoft documents such as Word, Excel and PowerPoint files. Apache Lucene doesn't have the build-in capability to process these files. Therefore, we need to use one of the APIs that enables us to perform text manipulation on MS documents files.

One such library is Apache POI, which we'll use in the article. You can read more about Apache POI.

Article applies to Lucene 3.6.0 and POI 3.8.0.

Sample code can be found here.

You may also refer to Apache Lucene Tutorial: Indexing PDF Files

Project Structure:

 org.fazlan.lucene.ms.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- FileIndexApplication.java  
     |          |-- FileIndexer.java  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- MSDocumentIndexer.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- files  
         |-- MSExcell.xls  
         |-- MSWord.doc  
         `-- MSWord.docx  

Step 1: Creating the Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.ms.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false  

Step 2: Updating the Maven Dependencies

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
   <modelVersion>4.0.0</modelVersion>  
   <groupId>org.fazlan</groupId>  
   <artifactId>org.fazlan.lucene.ms.demo</artifactId>  
   <packaging>jar</packaging>  
   <version>1.0-SNAPSHOT</version>  
   <name>org.fazlan.lucene.demo</name>  
   <url>http://maven.apache.org</url>  
   <dependencies>  
     <dependency>  
       <groupId>org.apache.lucene</groupId>  
       <artifactId>lucene-core</artifactId>  
       <version>3.6.0</version>  
     </dependency>  
     <dependency>  
       <groupId>org.apache.poi</groupId>  
       <artifactId>poi</artifactId>  
       <version>3.8</version>  
     </dependency>  
     <dependency>  
       <groupId>org.apache.poi</groupId>  
       <artifactId>poi-ooxml</artifactId>  
       <version>3.8</version>  
     </dependency>  
     <dependency>  
       <groupId>org.apache.poi</groupId>  
       <artifactId>poi-scratchpad</artifactId>  
       <version>3.8</version>  
     </dependency>  
     <dependency>  
       <groupId>junit</groupId>  
       <artifactId>junit</artifactId>  
       <version>3.8.1</version>  
       <scope>test</scope>  
     </dependency>  
   </dependencies>  
 </project>  

Step 3: Defining the MS Document Indexer


This is the most important component. The following code will load the content from a MS Word, MS Excel, MS PowerPoint or Visio file, and the extracted content is form into a String representation so that it can be further processed by Lucene for indexing purposes.

Extractors already exist for Excel, Word, PowerPoint and Visio; if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it based on the file extension.

 package org.fazlan.lucene.demo;  

 import org.apache.poi.extractor.ExtractorFactory;  
 import java.io.File;  
 import java.io.IOException;  

 public class MSDocumentIndexer implements FileIndexer {  

   public IndexItem index(File file) throws IOException {  

     String content = "";  
     try {  
       content = ExtractorFactory.createExtractor(file).getText();  
     } catch (Exception e) {  
       e.printStackTrace();  
     }  

     return new IndexItem((long) file.hashCode(), file.getName(), content);  
   }  
 }  

The above code uses a new class in POI 3.5 or higher, org.apache.poi.extractor.ExtractorFactory provides a similar function to WorkbookFactory. You simply pass it an InputStream, a file, a POIFSFileSystem or a OOXML Package. It figures out the correct text extractor for you, and returns it.

Using POI < 3.5 to Extracting Text Content

Extracting Content from MS Word Document
 //MS Word  
 POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("MS Word.doc"));   
 WordExtractor extractor = new WordExtractor(fs);   
 String content = extractor.getText();  


Extracting Content from MS Excel Document
 //MS Excel  
 POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("MS Excel.xls"));   
 ExcelExtractor extractor = new ExcelExtractor(fs);   
 String content = extractor.getText();   

Extracting Content from MS PowerPoint Document
 /MS PowerPoint  
 POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("MS PowerPoint.ppt"));   
 PowerPointExtractor extractor = new PowerPointExtractor(fs);   
 String content = extractor.getText();   

Step 4: Writing a Application


The following is a sample application code to index a MS document file.

 package org.fazlan.lucene.demo;  

 import org.apache.lucene.queryParser.ParseException;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.List;  

 public class FileIndexApplication {  

   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  

     MSDocumentIndexer msDocumentIndexer = new MSDocumentIndexer();  
     File msWordFile = new File("src/main/resources/files/MSWord.doc");  
     File msWord2003File = new File("src/main/resources/files/MSWord.docx");  
     File msExcellFile = new File("src/main/resources/files/MSExcell.xls");  

     // creating the indexer and indexing the items  
     Indexer indexer = new Indexer(INDEX_DIR);  
     indexer.index(msDocumentIndexer.index(msWordFile));  
     indexer.index(msDocumentIndexer.index(msWord2003File));  
     indexer.index(msDocumentIndexer.index(msExcellFile));  

     // close the index to enable them index  
     indexer.close();  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  
     List<IndexItem> result = searcher.findByContent("Microfost", DEFAULT_RESULT_SIZE);  
     print(result);  

     searcher.close();  
   }  

    /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }  

Summary:
This was a brief article on how to integrate Apache POI with Apache Lucene for indexing the contents in a MS documents such as Word, Excel and PowerPoint file.

Sample code can be found here.

2 comments:

  1. We are again one way forward in Text Extraction, Organize Text and Searching Text by Lucene, now for MS Documents. We will implement this idea in our applications

    ReplyDelete