Fog Creek Software
g
Discussion Board




Java API for converting word,PDF to text?

Looking for a good Java API which does this.
It has to be Java 1.4 compatible, multivalent is good but it runs on 1.5

Anon
Thursday, April 22, 2004

For Word, check out POI project at Jakarta.  Not sure about PDF - the effort seems to have been put into creating PDFs, not reading them.


Thursday, April 22, 2004

In addition to POI (which is still kinda rudimentary for Word) check out textmining.org and pdfbox.org

Brian
Thursday, April 22, 2004

For PDF read/write, you may want to take a look at iText.

http://www.lowagie.com/iText/

A.F.

Avrom Finkelstein
Thursday, April 22, 2004

We're in the process of getting ready to release PDFTextStream, a Java PDF-to-text conversion library that is focussed on (a) speed and (b) accuracy of text output.  In our internal testing, we've found it to be faster than all other available PDF-to-text libraries on the market, including iText, PDFBox, and Multivalent.  (There are even a couple of cases where it's faster than the native pdftotext utility, which makes it perfect for converting large batches of PDF's very efficiently.)  It supports 40- and 128-bit PDF encryption, and includes classes for easy integration with Lucene.

If anyone is interested in becoming a beta tester, please feel free to email me.

Chas Emerick
Tuesday, April 27, 2004

*  Recent Topics

*  Fog Creek Home