Java API for converting word,PDF to text?
Looking for a good Java API which does this.
For Word, check out POI project at Jakarta. Not sure about PDF - the effort seems to have been put into creating PDFs, not reading them.
In addition to POI (which is still kinda rudimentary for Word) check out textmining.org and pdfbox.org
For PDF read/write, you may want to take a look at iText.
We're in the process of getting ready to release PDFTextStream, a Java PDF-to-text conversion library that is focussed on (a) speed and (b) accuracy of text output. In our internal testing, we've found it to be faster than all other available PDF-to-text libraries on the market, including iText, PDFBox, and Multivalent. (There are even a couple of cases where it's faster than the native pdftotext utility, which makes it perfect for converting large batches of PDF's very efficiently.) It supports 40- and 128-bit PDF encryption, and includes classes for easy integration with Lucene.
Fog Creek Home