This article describes how to setup indexing of image files (including TIFF, PDF, JPEG, BMP...) using OCR technology. The indexing described below utilizes Microsoft IFilter technology, and as such, is not specific to SharePoint, but can be used with any product that uses Microsoft indexing: Microsoft Search, Desktop search, SQL Server search, and through the plug-ins with Google desktop search. I, however, use it with Microsoft Windows SharePoint Services 2003. For those other products, the registration may need to be slightly different.
One of the projects I was working on required storage of old documents scanned into PDF files. Then, there was a separate team of people responsible for providing tags for a search engine so those image documents could be found. The whole process was clumsy, labor intensive, and error prone. That was what started me on my exploration path.
The first search I fired was for the Open Source OCR products. Pretty quickly, I narrowed it down to TESSERACT (http://code.google.com/p/tesseract-ocr/). Tesseract is an orphaned brain child of HP that worked on it from 1985 to 1995. Then, it was moved to the Open Source, and now, if I understand it correctly, Google is working on it. With credentials like that, it's no wonder that Tesseract scores one of the highest marks on OCR recognition and accuracy. After downloading and struggling just a bit, I got Tesseract to work. The struggling part was that the home page claims that its base input format is a TIFF file. May be my TIFFs were bad, but I was able to get it to work only for BMP files.
Image files conversion
So now that I have an OCR that can convert BMP files into text, how do I get text out of the image PDF files? One more search, and I settled down on ImageMagic (http://www.imagemagick.org/). This is another wonderful Open Source utility that can convert any file into image. It did work out of the box, converting TIFF files into bitmaps, but to get PDF files converted, it requires a GhostScript (http://mirror.cs.wisc.edu/pub/mirrors/ghost/GPL/gs864/gs864w32.exe).
Dealing with text PDFs
With that utility installed, I was cooking - I can convert any file (in particular, PDF and TIFF) into bitmap, and then I can extract the text out of the bitmap. The only consideration was to somehow treat PDF files containing text differently - after all, OCR is very computation intensive, and somewhat error prone even with perfect image quality and resolution. So another quick search, and I have PDFTOTEXT (ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip) - thank God for Open Source! With these guys, I can pull text out of PDF in an eye blink. However, I would get nothing for pure image PDFs, but I already have a solution for that!
It took another 15 minutes to setup a batch script to automate the process:
- Check the file extension
- If file is a PDF file
- try to extract text out of it
- if there is more than certain amount of text in the file - done!
- if there is no text, convert first page into bitmap
- run OCR on the bitmap
- For any other file type, convert file into bitmap
- Run OCR on the bitmap
Once you unzip the attached project, check out the bin\OCR.BAT file. It will create a temporary file in the directory where your source file is with the same name + the '.txt' extension.
will generate the c:\temp\xyz.pdf.txt file.
So now I have a simple batch process to extract text out of any image and/or PDF file. To make it usable in SharePoint (or any other product that uses Microsoft Indexing technology), I need to create an IFilter component. This is a plug-in that Microsoft Indexing uses to search for specialized file formats (e.g., Office, PDF, ...).
Over here, there was a right way and a quick way. And I have to admit my guilt here - I chose the quick way. See, the thing is that all the components I use do have C/C++ APIs, and to do it right, I should pull everything together and create a component. Instead, I decided to just run the batch process I setup earlier. This is somewhat slower, but at least, I don't have to worry about memory leaks and page faults from code I'm not familiar with.
So I downloaded the Microsoft Platform SDK, got SmpFilt to work, changed GUIDs, got it to run my OCR.BAT - and here you have it - my own OCR plug-in to Microsoft Indexing.
Over here, I'm skipping over some pain and sweat of debugging IFilter, dealing with multi-byte to single byte strings and back, and all this fun that made Microsoft COM development so "loved" around the world. But the purpose of the article is not to teach how to do COM in C++ or how to develop IFilter.
Once you have your filter done and registered, the Platform SDK contains two utilities to test IFilter: filtdump.exe and filtreg.exe - you can play with them to make sure your filter is registered and works correctly.
The Microsoft IFilter template will do the appropriate registration for the Indexing Service, but SharePoint requires additional entries. In the download, there is a bin\wss_reg.reg file that will register SharePoint related entries. I would encourage you, however, to create a back up of the HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0 key before you try to register the wss_reg file - just in case, you know.
By the way, since I don't have an installer, the DLL (OcrFilt.dll) also needs to be registered manually.
- I use WSS SP2. If your version of SharePoint is different, your WSS registration entries may be different. Please check if you have the HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\e5ecafdd-0ed4-42fa-b663-c38046ae5ec8 key. If not, then your wss-reg.reg file may need to be updated.
- SharePoint stores a numbered list of extensions it is able to process in HKLM\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\12.0\Search\Applications\e5ecafdd-0ed4-42fa-b663-c38046ae5ec8\Gather\Search\Extensions\ExtensionList. The entry for PDF should not be there, and it needs to be registered with the next unique number. Most of the time, it should be 38. Please make sure that your numbering goes up to 37. If there is anything fishy, please review and update wss_reg as needed.
- After you install the filter, you would need to re-index the existing contents or remove/add files over. Until indexing is complete, you will not be able to find your entries. OcrFilt.dll will create entries in the Event Log for each file it needs to index, so you can follow the progress. I had to recycle the service and then use stsadm. Removing and adding files back also works.
- For performance reasons, only the first page of the PDF/TIFF file is OCR-ed. There are additional ImageMagic utilities to combine multiple images together before OCR-ing if you want to OCR the whole document.
- OCR.BAT will try to create a text file in the same folder your input image is in. As such, the indexing process should have appropriate privileges to that folder. Since this is where WSS creates the temp file, it should not be a problem; however, since rights issues are so difficult to troubleshoot, it's something to keep in mind.
- Even though you can OCR any image type, iFilter only registers PDF and TIFF extensions. If you want to process other file types, the OcrFilt.dll registration part will need to be modified.
Even though currently I'm using it only with SharePoint, there are other very interesting applications for this solution:
- Configure iFilter as a plug-in for SQL-Server, to allow indexing PDF files stored in BLOB columns.
- Structured documents. The ImageMagic convert utility that I use has an ability to extract part of an image. It will be pretty easy to change the batch file to extract, for example, portion of a scanned bill that contains the name and date to organize filing in the billing department.
Even though all the components are Open Source, you might want to verify that your company's legal department has no problem with each component's licensing requirements.
There is no better way to show your support then to donate money. The second best thing - is to vote for the article you like :-)