Fog Creek Software
Discussion Board


I have some information pages on house from my realtor.  We printed them from a computer right to pdf and the quality is excellant.  My goal is to OCR them and extract the text so that I can put it in a speadsheet to use with mappoint.

There is one lisiting per page and each page has a single picture in the same place on every page.  Each listing has a block of text at the top that I want to extract 5 columns of data from.  Each data elements are preceeded by the element name like "Address: 223 Park Ave."  There is also a small amount  of additonal data in table form below the main data that would be nice to capture.

I am looking for a program to ocr the data and then extract that data to a spreadsheet.  The picture should be extracted to a sperate tiff/jpg file.  The spreadsheet shoud include a link to the picture.

I used omnipage pro and it seemed to do a good job of ocr'ing the data and the picture but it doesnt seem to have the power to pull up the address as text after Address: etc.. 

Is there one program that will do this process start to finish or is there something I can use to read the omnipage output to pull out the picture and the text data?

Rich Zellmer
Wednesday, August 04, 2004


I gotta tell you (coming from the OCR and DocEx world), with the limited number of items that you're dealing with, you want to type them in by hand.

Unless you're working on more than a thousand pages, there won't be any net benefit to scanning them and OCR'ing them.

First you're not going to get 100% accuracy with the text conversion, so there will be some error.  The OCR will undoubtedly have more problems on addresses where dictionary words aren't guaranteed, and there is a mix of numbers and characters.  For mapping purposes, you will have to verify that these are correct anyways.

Second, scanning in the remaining items and dropping them into a spreadsheet or a database presents zoning problems.  Since each zone is variable (not guaranteed to be in the same spot due to text lengths and spacing concerns) you'll need an adaptive zoning algorithm.  This is one area where OCR software is far from accurate.  Your pictures will most likely not be identified more than 60% of the time as pictures.  The capture of the picture also generally takes too much (surrounding text) or too little, a cropped picture.

The best commercial software out there now isn't up to doing what you want it to do.  Even using the ScanSoft SDK to custom write your own zoning algorithm will likely produce errors.

Best of luck to you, and I hope your fingers don't cramp with all that typing; but it is still faster, easier, and less error prone than OCR'ing it.

Wednesday, August 04, 2004

If they're in PDF you don't need OCR.  You need pdftotext which is one of the Xpdf tools:

image extraction can also be accomplished using xpdf.

Wednesday, August 04, 2004

Agreed; several of our clients have IBM Content Manager and Ascent Capture rigged together to scan in content, but they have 500,000+ documents in there.  If you don't have volume that's pretty good, you might as well hire data entry to get the info in.  It will be more cost efficient.  Once the data's in there, if you can switch to electronic entry only i.e. online forms, you'll never even have to think about OCR again.

Wednesday, August 04, 2004

OCR.  There is a case of overpromise underdeliver if ever there were one.  By the time you fiddly f around scanning and ocr'ing and exceling you could have walked to the properties and looked at them.  Give it up the PC is not the center of the universe.  Get off your knees and quit worshipping useless technology that robs you of your time.

Wednesday, August 04, 2004


That's being a little cynical, not to mention, you didn't really provide any constructive criticism to someone who probably just wants to perform trend analysis of properties on a features/location/price basis.

Ease up and take a yoga class or something.

Wednesday, August 04, 2004

It doesn't appear any scanning is needed. Remember the OP said he printed out the .pdf on his computer.

Stephen Jones
Wednesday, August 04, 2004

If the PDF is text based, he can just cut and past directly from the PDF. If the PDF is a scanned image, he'll need to OCR it.

By the way, sysadmins message was correct in content, though the delivery was over the top.

Wednesday, August 04, 2004

*  Recent Topics

*  Fog Creek Home