Fazlan's Blog Spot: Apache Lucene Tutorial: Indexing PDF Files

Overview:
This article is a sequel to Apache Lucene Tutorial: Lucene for Text Search. Here, we look at how to index content in a PDF file. Apache Lucene doesn't have the build-in capability to process PDF files. Therefore, we need to use one of the APIs that enables us to perform text manipulation on PDF files.

One such library is Apache PDFBox, which we'll use in the article. You can read more about Apache PDFBox.

Article applies to Lucene 3.6.0 and PDFBox 0.7.3.

Sample code can be found here.

You may also refer to Apache Lucene Tutorial: Indexing Microsoft Documents

Project Structure:

 org.fazlan.lucene.pdf.demo  
 |-- pom.xml  
 `-- src  
   `-- main  
     |-- java  
     |  `-- org  
     |    `-- fazlan  
     |      `-- lucene  
     |        `-- demo  
     |          |-- FileIndexApplication.java  
     |          |-- FileIndexer.java  
     |          |-- Indexer.java  
     |          |-- IndexItem.java  
     |          |-- PDFIndexer.java  
     |          `-- Searcher.java  
     `-- resources  
       `-- files  
         `-- HelloPDFBox.pdf

Step 1: Creating the Project

 mvn archetype:generate -DartifactId=org.fazlan.lucene.pdf.demo -DgroupId=org.fazlan -Dversion=1.0-SNAPSHOT -DinteractiveMode=false

Step 2: Updating the Maven Dependencies

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  <modelVersion>4.0.0</modelVersion>  
  <groupId>org.fazlan</groupId>  
  <artifactId>org.fazlan.lucene.pdf.demo</artifactId>  
  <packaging>jar</packaging>  
  <version>1.0-SNAPSHOT</version>  
  <name>org.fazlan.lucene.demo</name>  
  <url>http://maven.apache.org</url>  
  <dependencies>  
   <dependency>  
      <groupId>org.apache.lucene</groupId>  
      <artifactId>lucene-core</artifactId>  
      <version>3.6.0</version>  
   </dependency>  
    <dependency>  
      <groupId>pdfbox</groupId>  
      <artifactId>pdfbox</artifactId>  
      <version>0.7.3</version>  
    </dependency>  
   <dependency>  
    <groupId>junit</groupId>  
    <artifactId>junit</artifactId>  
    <version>3.8.1</version>  
    <scope>test</scope>  
   </dependency>  
  </dependencies>  
 </project>

Step 3: Defining the PDF Indexer

This is the most important component. The following code will load the content from a PDF file, and the extracted content is form into a String representation so that it can be further processed by Lucene for indexing purposes.

 package org.fazlan.lucene.demo;  
 import org.pdfbox.pdmodel.PDDocument;  
 import org.pdfbox.util.PDFTextStripper;  
 import java.io.File;  
 import java.io.IOException;  
 public class PDFIndexer implements FileIndexer {  
   public IndexItem index(File file) throws IOException {  
     PDDocument doc = PDDocument.load(file);  
     String content = new PDFTextStripper().getText(doc);  
     doc.close();  
     return new IndexItem((long)file.getName().hashCode(), file.getName(), content);  
   }  
 }

Step 4: Writing a Application

The following is a sample application code to index a PDF file.

 package org.fazlan.lucene.demo;  

 import org.apache.lucene.queryParser.ParseException;  
 import java.io.File;  
 import java.io.IOException;  
 import java.util.List;  

 public class FileIndexApplication {  

   // location where the index will be stored.  
   private static final String INDEX_DIR = "src/main/resources/index";  
   private static final int DEFAULT_RESULT_SIZE = 100;  

   public static void main(String[] args) throws IOException, ParseException {  

     File pdfFile = new File("src/main/resources/files/HelloPDFBox.pdf");  
     IndexItem pdfIndexItem = new PDFIndexer().index(pdfFile);  
     // creating the indexer and indexing the items  

     Indexer indexer = new Indexer(INDEX_DIR);  
     indexer.index(pdfIndexItem);  

     // close the index to enable them index  
     indexer.close();  

     // creating the Searcher to the same index location as the Indexer  
     Searcher searcher = new Searcher(INDEX_DIR);  
     List<IndexItem> result = searcher.findByContent("World", DEFAULT_RESULT_SIZE);  
     print(result);  

     searcher.close();  
   }  
    /**  
    * print the results.  
    */  
   private static void print(List<IndexItem> result) {  
     System.out.println("Result Size: " + result.size());  
     for (IndexItem item : result) {  
       System.out.println(item);  
     }  
   }  
 }

Summary:
This was a brief article on how to integrate Apache PDFBox with Apache Lucene for indexing the contents in a PDF file.

Sample code can be found here.

Fazlan's Blog Spot

Thursday, June 7, 2012

Apache Lucene Tutorial: Indexing PDF Files

2 comments: