Fog Creek Software
Discussion Board

How are file formats reverse engineered?

I am curious to know how proprietory file formats are reverse engineered. For example, Wordperfect seems to open a microsoft document quite nicely although the Doc format is proprietory and ms does not reveal it. Ditto for microsoft word opening wordperfect. Adobe acrobat is another example. Though PDF is propretory, many other companies have cropped up which can convert , for instance Word documents to PDF .

Can anyone give a laymans explanation on how it is actually done? I am an application developer working on databases and as such have little knowledge of these things.

Wednesday, July 7, 2004

Create a blank file in the source application. Save it. Look at it in a Hex editor.

Create a new file in the source application. Type a single 'A'. Save it. Look at it in a hex editor.

Make the 'A' boldface. Save.

etc. ad nauseum. It's like solving a puzzle; it's never really hard since you can generate all the test data you need, but very exacting and time consuming.

Chris Tavares
Wednesday, July 7, 2004

I would imagine most of the text in a Word Doc is stored in standard RTF format and it's probably precisely those parts that aren't that will most likely break in a third party app - complex formulas, embedding, tables, etc.

Doesn't Open Office work with MS Office documents? Ask them.
Wednesday, July 7, 2004

Alpha Release Male
Wednesday, July 7, 2004

There's some good info on this in the Jakarta POI project, an open source effort to provide a Java API for Microsoft file formats.

The Microsoft office file native format is incredibly complex, actually.  Some of it has been documented by Microsoft, some of it was figured out the hard way.

Wednesday, July 7, 2004

how long do you think until someone uses the DMCA as a protection against having its formats reverse engineered?

Wednesday, July 7, 2004

How long until someone reverse engineers the DMCA?

I wonder if, then, "MS Word" can be considered the author of a file it spits out, and therefore have legal rights over it.

(I'm being facetious.)
Wednesday, July 7, 2004

Regarding the original post: While the PDF format is proprietary, its full documentation is available as a free download at the Adobe website.

Microsoft, however, is probably too ashamed to document the native MS Word format. :)

Chris Nahr
Wednesday, July 7, 2004

They (MS) do have documentation freely available for the XML versions of their Office document formats.

This is useful, as you can create Office documents on a webserver, using just XML, i.e. you don't need to install Office on the server. Even works fine with non-MS web servers.

Steve Jones (UK)
Wednesday, July 7, 2004

But then that requires the people downloading said Office file to be using Office 2003 doesn't it? Which has very little reach at the moment.

James 'Smiler' Farrer
Wednesday, July 7, 2004

Uh... since it's XML, you don't need any MS-specific anything.  As long as you have the schema, it is quite simple to create new documents.

And the issues with Office 2003 are true.  I refuse to install it because it hijacks the viewing of XML documents in the browser too.  I want to see the raw XML, not their stuff.

Wednesday, July 7, 2004

KC, you can do that just as well with binary .doc files; I can give you the schema. It's a set of bytes. It also has an XML-equivalent schema: The docs look like this: <xml><hexdata>ddef2817efdd....</hexdata</xml>

Schema is syntax, which is necessary but entirely insufficient. To get anything done, you need the semantics - and you need those from MS.

Ori Berger
Wednesday, July 7, 2004

SDKs are available for Flash and Acrobat by Macromedia and Adobe, respectively.

Green Pajamas
Wednesday, July 7, 2004

wordperfect format is open and documented and downloadable in the SDK.

Wednesday, July 7, 2004

Microsoft does make available the details of the file format for an older version of Word.  All the open source software that read and write word documents started with that document.

Almost Anonymous
Wednesday, July 7, 2004

All of the messages here so far seem to indicate the obvious way to go about reverse-engineering file formats when you can create examples at will and according to whatever criteria you desire.

I've had to do a harder tasks twice:  given 1 or only 5 or less files that contain several megabytes of data of different kinds, figure out how it's stored.  I did start out knowing the purpose and basic form of the stored data (heck, that's why I knew I wanted to go through the pain).

First, observe that it's common to store data using a tagging scheme and some sort of index region or table of contents (it's tantamount to necessary).  This could be a fixed layout in a header block in a binary file or a text-indexed table of contents with offsets but it's usually there in files over a meg. with heterogeneous contents.

Second, write a simple program to walk the file one byte a time and output four byte quantites interpreted in both directions on one line.  So line is is the 32-bit int you get by interpreting the first four bytes of the file big-endian then little-endian.  Line 2 is the two numbers you get for the second through 5th bytes, and so on.

Assuming this is small enough, load it in excel (you wrote it tab-delim or csv, right?) and compute a third and fourth column telling whether each of the first two is valid offset into the file.

Rapidly page down through your spreadsheet and look for a tight cluster of regularly spaced proper file offsets.  Much of the time you just found an indexing region.

If you hex dump and ascii dump on each line as well, you can often see if there's naming text around the offsets.

If the file is too big to dump this way, just write some code to output only the offsets in the file where valid offsets into the file are stored.

I couldn't generate my own data because it was produced by a complex physical apparatus to which I had no access... just a few files found by happenstance.

From the above, there are of course lots of intuitive minor next moves.  But in general, leveraging the need for internal indexing is pretty powerful when you can't make any example you want...

Thomas E. Kammeyer
Wednesday, July 7, 2004

The PDF file spec, while available, doesn't make it that much easier to reliably read and make use of the data in a PDF file.  Part of that is the partial incompleteness of the spec -- a lot of the little niggling details are implied instead of made explicit.  That makes the job of building a PDF parser that does something useful very tedious and exacting, as a poster write upthread.


Chas Emerick

Snowtide Informatics :
PDFTextStream: High Performance PDF Text Extraction Java Library

Chas Emerick
Monday, August 23, 2004

*  Recent Topics

*  Fog Creek Home