Thursday, June 7, 2012

Apache Lucene Tutorial: Indexing PDF Files

Overview:
This article is a sequel to Apache Lucene Tutorial: Lucene for Text Search. Here, we look at how to index content in a PDF file. Apache Lucene doesn't have the build-in capability to process PDF files. Therefore, we need to use one of the APIs that enables us to perform text manipulation on PDF files.

One such library is Apache PDFBox, which we'll use in the article. You can read more about Apache PDFBox.

Article applies to Lucene 3.6.0 and PDFBox 0.7.3.

Sample code can be found here.

You may also refer to Apache Lucene Tutorial: Indexing Microsoft Documents

Project Structure:

 org.fazlan.lucene.pdf.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- FileIndexApplication.java  
     |          |-- FileIndexer.java  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- PDFIndexer.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- files  
         `-- HelloPDFBox.pdf  

Step 1: Creating the Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.pdf.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false  

Step 2: Updating the Maven Dependencies

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  <modelVersion>4.0.0</modelVersion>  
  <groupId>org.fazlan</groupId>  
  <artifactId>org.fazlan.lucene.pdf.demo</artifactId>  
  <packaging>jar</packaging>  
  <version>1.0-SNAPSHOT</version>  
  <name>org.fazlan.lucene.demo</name>  
  <url>http://maven.apache.org</url>  
  <dependencies>  
   <dependency>  
      <groupId>org.apache.lucene</groupId>  
      <artifactId>lucene-core</artifactId>  
      <version>3.6.0</version>  
   </dependency>  
    <dependency>  
      <groupId>pdfbox</groupId>  
      <artifactId>pdfbox</artifactId>  
      <version>0.7.3</version>  
    </dependency>  
   <dependency>  
    <groupId>junit</groupId>  
    <artifactId>junit</artifactId>  
    <version>3.8.1</version>  
    <scope>test</scope>  
   </dependency>  
  </dependencies>  
 </project>  

Step 3: Defining the PDF Indexer


This is the most important component. The following code will load the content from a PDF file, and the extracted content is form into a String representation so that it can be further processed by Lucene for indexing purposes.

 package org.fazlan.lucene.demo;  
 import org.pdfbox.pdmodel.PDDocument;  
 import org.pdfbox.util.PDFTextStripper;  
 import java.io.File;  
 import java.io.IOException;  
 public class PDFIndexer implements FileIndexer {  
   public IndexItem index(File file) throws IOException {  
     PDDocument doc = PDDocument.load(file);  
     String content = new PDFTextStripper().getText(doc);  
     doc.close();  
     return new IndexItem((long)file.getName().hashCode(), file.getName(), content);  
   }  
 }  

Step 4: Writing a Application


The following is a sample application code to index a PDF file.

 package org.fazlan.lucene.demo;  

 import org.apache.lucene.queryParser.ParseException;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.List;  

 public class FileIndexApplication {  

   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  

     File pdfFile = new File("src/main/resources/files/HelloPDFBox.pdf");  
     IndexItem pdfIndexItem = new PDFIndexer().index(pdfFile);  
     // creating the indexer and indexing the items  

     Indexer indexer = new Indexer(INDEX_DIR);  
     indexer.index(pdfIndexItem);  

     // close the index to enable them index  
     indexer.close();  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  
     List<IndexItem> result = searcher.findByContent("World", DEFAULT_RESULT_SIZE);  
     print(result);  

     searcher.close();  
   }  
    /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }  

Summary:
This was a brief article on how to integrate Apache PDFBox with Apache Lucene for indexing the contents in a PDF file.

Sample code can be found here.

2 comments:

  1. hi fazlan great artical about lucene pdf search but with a big problem
    when i use your code for searching in a pdf book document as a result i find whole book text at console.

    ReplyDelete
    Replies
    1. In the sample implementation of Searcher.java, I have included the file content as part of each result matched.

      However, you can alter the following block of code to prevent the content of the file being rendered in the result from

      for (ScoreDoc scoreDoc : queryResults) {
      Document doc = searcher.doc(scoreDoc.doc);
      results.add(new IndexItem(Long.parseLong(doc.get(IndexItem.ID)), doc.get(IndexItem.TITLE), doc.get(IndexItem
      .CONTENT)));
      }
      to,

      for (ScoreDoc scoreDoc : queryResults) {
      Document doc = searcher.doc(scoreDoc.doc);
      results.add(new IndexItem(Long.parseLong(doc.get(IndexItem.ID)), doc.get(IndexItem.TITLE), ""));
      }

      by simply discarding "doc.get(IndexItem.CONTENT)" while creating results.
      That should display ONLY the resulting ID, and title of the file.

      cheers,
      Fazlan

      Delete