Fog Creek Software
Discussion Board




parser to convert to text various file formats

Is there any generic parser which parses the 200 or so document formats supported by google to conver to text? it does not have to be the 200, but a good superset will do?
I have to build an indexing engine for document search,and unless all documents are converted to text, thay can't be searched.
Any suugestions? I don't want to go and write a parser for each format.I will be probably using lucene for the indexing and searching.

Anon
Friday, April 16, 2004

[caveat - you didn't mention platform - these are Microsoft solutions]

SharePoint!!!

[grin - come on, that was a gimme]

If SharePoint is overkill, check out the indexing services, which can index many files natively, and you can add iFilters for other file types (Adobe provides a PDF iFilter for free)

If you're on Linux, I'm sure there's an equivalent there - I'd look for one that uses iFilters, since many of those are already extant.

Philo

Philo
Friday, April 16, 2004

Assuming you only want to index and search for the original document a quick and dirty solution is to run 'strings' to extract all text strings longer than 'n' chars and index them.

Martin Beckett
Saturday, April 17, 2004

*  Recent Topics

*  Fog Creek Home