Fog Creek Software
g
Discussion Board




Stitching together MS Word Documents

So I want to do something impossible: stitch several MS Word documents into one big document using Java on a Linux server.

I found some Windows DLLs that claim to do this, but that's not going to help under Linux.

I could call some command line tools, but I can't find any that will do this. If OpenOffice were scriptable in this way, that would be great, but I don't think it is.

Any suggestions?

Fred0
Thursday, December 4, 2003

Let's see. OpenOffice IS scriptable, in fact the OpenAPIs seem quite decent (I say this with the least possible experience beyond reading the Doc).

OpenOffice has several external SDKs, including one in Java. Apparently, the premise is to run the binary version of the application as a server and interface with it using the SDK. This from one of my co-workers.

When I read through the documentation, a lot of it seems to be focused on creating plugins for OpenOffice, but it seems flexible in that regard. I'd be interested in hearing from someone with experience using it.

http://www.openoffice.org/dev_docs/source/sdk/

Also, Apache has HWPF, which is in early development but might be of some use. If you're using simple documents, it might be worth a go.

http://jakarta.apache.org/poi/hwpf/index.html

Alternately, you could probably save the files into a different format and then rework the front end with any formatting language.

Dustin Alexander
Thursday, December 4, 2003

If there isn't an easier way, you could submit a request to Microsoft for the specification to the Microsoft Word binary file format, and then roll your own program to concat multiple files.  I don't know how much supplication to Microsoft is required.  <g>

http://support.microsoft.com/?kbid=290958

Robert Jacobson
Thursday, December 4, 2003

Do you want to type the code with both hands tied behind your back as well?

Linux
  ^
VMware
  ^
Windows
  ^
Word <- automation <- Java

Just me (Sir to you)
Thursday, December 4, 2003

Dead simple if you were running IIS and Office.

Just use VBA and Word Automation

Pseudo-code:

Open new doc
For list of files
  Insert From File
Next
Save doc


DJ
Thursday, December 4, 2003

Are you saying my hands are tied behind my back because I'm not using Microsoft?

If I really wanted to I could have another Windows machine or VMWare system that my application servers talk to in order to get this done. But only if there is no better way.

I'm looking into the StarOffice API a little more. I also like the idea of converting to another format, doing the stitching, then converting back. My software normally deals with PDFs, which are dead simple to work with in Java. If I can find something that compiles under Linux or runs in Java to easily convert back and forth, that might work.

Fred0
Thursday, December 4, 2003

Well, DJ is right that it would be stupid-simple if you could use Office automation.  (Microsoft Word exposes a very rich API called the VBA Automation Model that you can access through COM.)  Here are some articles that discuss how to access this through Java:

http://www.land-of-kain.de/jacob/

http://www.java400.de/default.html?Javactpe.htm

http://j-integra.intrinsyc.com/j-integra/doc/other_examples/Word_from_Java.htm

Robert Jacobson
Thursday, December 4, 2003

For "Word Document" you mean .DOC? .RTF? .HTML generated by word? .XML by Word 2003? Later 2 could be way easier then former...

WildTiger
Thursday, December 4, 2003

Um, I would ask whoever created the MS Word documents to use Word to stitch them together into one big document. That process takes a click or two of the mouse.

Brad
Thursday, December 4, 2003

Without context this is hard, but I'm guessing the word documents are provided by a number of third parties and it is the OPs job to aggregate them into an online delivery system.

An explanation of where this question is coming from might be useful here.

Dustin Alexander
Thursday, December 4, 2003

Also note you're not supposed to run Word on the server. Though many people do, it's not designed for it (imagine some random dialog box popping up) nor supported.

So finding another utility, on any OS, is good if you can get that to work.

mb
Thursday, December 4, 2003

I don't know if you can use Word Merge but you could generate your data in some sort of text file then shuttle it off to a Win32 box to merge the file and the template.  That is what we had to do in order to automate Word Mail Merge on our Unix boxen.

MR
Thursday, December 4, 2003

Good point mb.  Here's some information from Microsoft about running an Office application on a server.  The official party line is "you shouldn't do it, but here's how you can if you have to."

http://support.microsoft.com/default.aspx?scid=http://support.microsoft.com:80/support/kb/articles/Q257/7/57.asp&NoWebContent=1

Robert Jacobson
Thursday, December 4, 2003

If formatting isn't important, you'll probably find that there's a Word to text converter out there in the public domain.

In particular, take a look Lucene and see if there's a filter for indexing Word documents. If there is, then you can probably  easily adopt that code to do what  you want it to do.

Of course, if there's images, tables, OLE objects, etc embedded in the word documents and you care about them... well... good luck.

Burninator
Thursday, December 4, 2003

*  Recent Topics

*  Fog Creek Home