Fog Creek Software
g
Discussion Board




Reading different file formats?

I have different file formats - all microsoft formats, pdf,ps,csv, and a whole lot of others - around 300 of them, which our software is supposed to support.
We have to search through all those document formats, which require indexing the documents, which means extracting the text from the docs.
There are some 3rd party solutions, which say they can read and index upto 150 document formats.Does that mean that they wrote a parser to parse each and every one of those document formats?
I think it's not possible, but is there anyway that given a binary stream of data from any one of those formats, can we extract the text from the stream for indexing?
or do we have to know the internal layout of all those formats and make a parser for each?
I really doubt that 3rd party software people wrote 150 different parsers for 150 different formats.

Challenged
Thursday, April 29, 2004

First to address the question, in most format's, text is text, so if the goal of the exercise is to given a certain word, pull up all documents that contain it, you should be fine.  You'll just have to build your parser so that it filters out non-Ascii text strings, and then index the rest.

My question is, why are you doing this?  What benefit is there in indexing 300 different formats vs 150?  I have to say, of the document types that contain text, I have maybe 10 different varieties.  In fact most people only use a couple of programs for document generation.  Even if I was using an index program that indexed 90% of the formats I used, I'd still be a happy customer.  Mostly, if it didn't index document type X, I'd probably stop using that program if I could, or realize that if nothing is returned, I should go manually look through those documents.  Either way, still plenty of time saved.  I guess I really don't see any motivation for being able to index 300 document types over 150.  If all you need is indexing capability (and the indexing program itself isn't the final product), why not use the 3rd party app instead of growing your own?

Elephant
Thursday, April 29, 2004

If I were going about that task, I would break the documents into a set of attributes that define how the format is correct. Eg, whether a certain document uses tags (XML, HTML) or control characters (RTF AFAIK). Then I'd create common text abstraction methods for all the different types of documents.

Then I would create an index of the document filenames, and set what methods they use. There may be some that need customizing, but the majority would probably be ok.

That was just the first thing that came to mind though, probably not the best solution.

Paul
Thursday, April 29, 2004

Unix installations typically have a program called "strings", that can filter any file, and extract sequences of letters that are at least 4 characters long.  Of course, these sequences are often not words, so you may want to use that approach with a dictionary to seperate the wheat from the chaff...

You are also out of luck if the file uses some sort of encoding that masks the letters somehow, or intersperses the letters with other data.  I've seen a lot of generated PostScript files that do this, for example...

joev
Thursday, April 29, 2004

Search on "IFilter"

Philo

Philo
Thursday, April 29, 2004

Why are you re-inventing the wheel? Buy dtSearch. It does all you want, including indexing and returning text streams for tons of document formats.

Brad Wilson (dotnetguy.techieswithcats.com)
Thursday, April 29, 2004

The users put documents in our systems, and we cannot control if the user puts in metadata or not.
DTsearch might work for some file formats, but our system supports close to 350 file formats, and we should have the ability to search at least 80-90% of these formats.

Challenged
Thursday, April 29, 2004

You folks that think you can dump a .doc or .pdf file using strings are truly naive.  Dump a doc or pdf file sometime (with "od -c" for instance); chances are you won't get a single recognizable string.

5v3n
Thursday, April 29, 2004

"I really doubt that 3rd party software people wrote 150 different parsers for 150 different formats."

Ahhh, the old argument that because you can't see yourself doing something, it must be impossible.

How many formats does PaintShopPro understand?  They use  the file extension to make a 1st guess, then read the 1st 50 bytes or so and examine them (for instance the presence of the "GIF87" string is a stromg clue that the file is a GIF).  You'd do the same thing in your parser.

5v3n
Thursday, April 29, 2004

5v3n, can you suggest anyway of doing this, that is extracting actual text from binary streams?

Challenged
Thursday, April 29, 2004

5v3n:

You are quite wrong.  I just took a random Microsoft Word document available on the web (so you can play along at home) at:

http://www.microsoft.com/smserver/evaluation/casestudies/casestudy.asp?CaseStudyID=14999

The document is significantly complex with layouts and background images, and all sorts of marketing goodies in it.  I saved it to disk, opened it in Notepad, and copied and pasted some of the text below.  Any questions?


<sample text>

Overview
Country: United States
Industry: Communications

Customer Profile
Motorola has been in business since 1928. With headquarters near Chicago, Illinois, the company has become a global leader in wireless, automotive, and broadband communications.

Business Situation
Motorola needed a way to better manage its global infrastructure and gain greater control, reliability, and security, while reducing costs.

Solution
Motorola migrated from Windows NT® Server 4.0 to Microsoft® Windows Server SystemTM, consolidated its infrastructure, and implemented a change and configuration management solution.

Benefits
Estimated savings of U.S.$11 million in annual software deployment
More than 1 million software updates implemented in 2003
Microsoft Systems Management Server server count reduced by 50 percent
Domains reduced from 600 to 1
“Thanks to Windows Server System, we have the ability to do more with less. Windows Server System helps us to simplify deployment and management so we can reduce the costs of ongoing operations.”
Steven Bramson, Senior Systems Architect, Motorola


Elephant
Thursday, April 29, 2004

You got lucky.  Now edit that document 100 times and see if it still works.

5v3n
Thursday, April 29, 2004

I'm quite sure the document was edited a couple hundred times.  I doubt it was perfect at first creation.  Someone had to massage it to its current form, and everything seems fine.  In fact scanning word documents on my hard drive, I still haven't found a single case where I can't find the text in the file.

Elephant
Thursday, April 29, 2004

Microsoft Word stores text internally in a doubly-linked list of "runs". When you insert a lot of text in the middle it creates a new run. In the native Word doc runs are not necessarily written to disk consecutively... there's a feature called "Fast save" in which Word just appends changed and new runs to the end of the disk file so as to save faster. Last I remembered, Word does 10 fast saves for every 1 full save. Fast saves are dramatically faster than full saves on long documents.

Meaning ... just using "strings" to pull strings out of the file doesn't work for arbitrary Word files since the order of the strings on disk may not be the same as the order in the original file. In addition there are likely to be runs in the disk file which have since been deleted from the document, often to the effect of much embarassement and hilarity.

http://support.microsoft.com/default.aspx?scid=http://support.microsoft.com:80/support/kb/articles/Q197/9/78.ASP&NoWebContent=1

Joel Spolsky
Fog Creek Software
Thursday, April 29, 2004

dated, but the only thing i could ever find. the perl and php OLE and Spreadsheet::Excel_Writer packages are based on it:

http://user.cs.tu-berlin.de/~schwartz/pmh/guide.html

josheli
Thursday, April 29, 2004

Would it not be possible to support these document types incrementally ?

Just pick the low hanging fruit, as it were.. Even Google doesn't index ALL known document forms, they just do a select few. Why not pick $DOCUMENTS, ie: all plain text forms, PDF, CSV, MS Office formats (Powerpoint and Word, I guess) and push the product out ?

Get some feedback, then think about if the other document forms are necessary.. sometimes, you might be able to silently convert from one format to a supported format (if all you want to do is return a list of documents which contain specific keywords, that conversion trick would certainly work).

I may be over-generalizing, but if you have 150 different document formats, perhaps a bit of consolidation might be better/easier than trying to write code that will index all of those ?

deja vu
Friday, April 30, 2004

Well, is it possible to identify the MS Word .doc format? that would seem to be the key. If you knew this format, you could work with .doc independent of MS Word.

-leon

Leon Spencer
Monday, May 3, 2004

The Word file format is proprietary and kept secret. You only get to look at it if Microsoft judges you worthy and you sign a nondisclosure agreement.

Microsoft's public text interchange format is RTF, and nowadays also WordML (not sure about legal restrictions on the latter, though).

By the way, you wouldn't really want to work with the native Word file format anyway because it's huge binary mess that's tightly coupled to Word's internal structure -- basically a memory dump!

Chris Pratley has recently blogged some very interesting stories about Word, including one about the file format:
http://weblogs.asp.net/chris_pratley/archive/2004/04/29/123619.aspx

Chris Nahr
Monday, May 3, 2004

*  Recent Topics

*  Fog Creek Home