Fog Creek Software
Discussion Board

OCR from PDF file

I have a printed address list that I want to import into Excel. The list is too long for me to type, so I want to scan and OCR it.

The list is also too long for me to scan by hand on my flatbed at home, so I'll go to a print shop and pay them to scan it. I can pay 20 cents per page to simply have them scan it to a PDF file (presumably as graphics), or I can pay three times that to have them scan and OCR them. I can't afford that.

So I'm thinking I'll pay the lower cost, and then take the resulting PDF file home and use OCR software on that. So now I'm looking for the cheapest reliable OCR software that can use PDF files as input.

Any suggestions?

Tuesday, February 3, 2004

Is there a reason they scan it to a PDF file?  It would make more sense just to scan it to a bitmap format, like TIFF or JPEG, which almost any OCR software should support.

Robert Jacobson
Tuesday, February 3, 2004

There's an Acrobat plug-in that can do OCR on acrobat files, but all that give you is a more searchable PDF; it doesn't extract the actual text. Plus you need to buy the full version of Acrobat, and you've already said you're cheap.

Scanning into a graphics format is a much better solution.

Chris Tavares
Tuesday, February 3, 2004

Is there some reason you can't scan the list in chunks? If you have a 40-inch tall list that won't fit on your scanner, I would think you can just fold it in thirds and make three scans.

Caliban Tiresias Darklock
Tuesday, February 3, 2004

ScanSoft's SDK supports it, so I imagine OmniPage 14 or whatever supports it.  $99 and it's the best thing that's out there.  In fact, it's a really cool feature.  It has the ability to create it as a new PDF replacing the existing test w/ simmilar fonts, and leaving the original image untouched for unrecognizable characters.

Tuesday, February 3, 2004

I specifically remember having an OCR plugin once, it actually gave out real text, but the 30 days expired and here I am.

I don't recall the name.
Tuesday, February 3, 2004

I think getting them to OCR it for you is the best bet.  Shop around, you may find cheaper places.

Otherwise it is going to be a HUGE hassle.

Besides you still have to verify the correctness the OCR output.

Tuesday, February 3, 2004

Why not pay some University student to type it up for you?  It'd be pretty cheap and not too hard to organise.

Tuesday, February 3, 2004

You should've paid for the BULKPLUS package, which has the addresses on CD.

Tuesday, February 3, 2004

Caliban Tiresias Darklock: When I said "it's a long list", I didn't mean a single long page. I meant 500 pages. And I don't want to do 500 scans.

Robert Jacobson: I'll ask them if they can scan to TIFF files. I'm guessing the reason they don't is related to the fact that a PDF can contain multiple pages, whereas they'd have to create a separate TIFF file for each page. Still, I'd think professional software could create the 500 files and autonumber them.

Elephant: Actually, OmniPage is the software I have, albeit the "special edition" that came with my scanner. I couldn't verify that it reads PDF files -- it lists them as a file type but couldn't read the file. This may be a disabled function in the "special edition". The retail version is $150 or $200, depending on which version you buy. Can you tell me where you found it for $100?

DJ: I may end up taking your advice and having them OCR it. Saves me a ton of work, and then I have someone to blame if the OCR sucks.

Tuesday, February 3, 2004

>I may end up taking your advice and having them OCR it. Saves me a ton of work, and then I have someone to blame if the OCR sucks.

The OCR output will always suck. I think the original suggestion implied that having someone else do the OCR means you'll only have to proofread instead of having to do the OCR and proofread.

You can convert pdf files to image files, btw. Search on google.

Wednesday, February 4, 2004

download ghostscript (free), use 'extract text' function

Wednesday, February 4, 2004

Ghostscript will only extract text encoded as text, not graphics.

Ged Byrne
Wednesday, February 4, 2004

View the pdf file and take screenshots of each page. Apply OCR to taste.

Wednesday, February 4, 2004

What I ended up doing was paying them to OCR it. They finally quoted me a much lower price -- rather than three times the price per page, it is just a flat $25 additional. So I paid it.

Guess what? Their OCR software sucks. So there's a lot of manual cleanup required. But it was still cheaper than $150 for a full copy of OmniPage, which I'm not likely to need in the future since the SE version does most of what I want.

Thanks for all your help and suggestions!

Friday, February 6, 2004

Zahid, I emailed you and offered to do it for free.  Seems that is better still than a $25 flat fee.

Friday, February 6, 2004

*  Recent Topics

*  Fog Creek Home