Fog Creek Software
Discussion Board

search engine to be built into J2EE based product

Our company has a web based J2EE product, where we need to incorporate document search facilities.Currently there is a basic text based search of the database, our database contains documents in Word,PDF, etc.., which needs to be searched.
I guess that would require a search engine, the contraints are that it should search close to 200 known document formats, as well as can be integrated completely into our product ( no .exe's and such). The documents are resident inside the database, and not on disk, or URL'd.
I have looked at some of the things available out there from the google toolbar to Jakarta Lucene. If necessary we can do some coding, but also don't mind a straight solution.
It should be a be able to be bundled into the product which uses J2EE and run on a windows machine.

Wednesday, April 14, 2004

That's all very interesting, thanks for sharing. Oh meant to include a question about it all?

Sorry, Couldn't Resist
Wednesday, April 14, 2004

Yeah! forgot to ask the question, has anyone done something similar before?

Wednesday, April 14, 2004

I believe the most widely used solution to index binary formats is Oracle Intermedia, which only works on top of the Oracle database.

But since I guess you're searching for a self-contained solution, my only bet would be Jakarta Lucene, plus some coding and third party tools integration to build the filters to convert from binary formats to text. I don't know if this converters are currently being provided by Lucene -last time I looked at this package they weren't-.

I'm also interested in a solution that can be easily packaged in a shrinkwrap software, and handles the most popular file formats (MS Office, PDF, ...).

Wednesday, April 14, 2004

Does google licence any of their PDF/PPT indexing technology?

I know they do the google appliance, but maybe they also do some sort of binary only licence?

Chris Ormerod
Thursday, April 15, 2004

*  Recent Topics

*  Fog Creek Home