Fog Creek Software
Discussion Board




MS Word .doc word counting

Hello everyone,

I'd like to do (Java, if possible) server-side word counting on the MS Word .doc files that my users upload. I have no idea how to do this. Can someone kindly provide some tips? Many thanks in advance!

Rex Guo
Monday, May 06, 2002

ms-word stores the results of the last Word Count executed on the  document in a field of the document header itself. If this is what you want , then you can pick it up from that field. Otherwise you will have to parse the file and do the counting yourself :-) not an easy job given that the format is binary and not really in public domain. I don't think the real word count is stored in the document .

shailesh kumar
Monday, May 06, 2002

Shailesh,

Thanks for the info!
What do you mean when you said the real word count is not stored in the document? Is there a fake word count?

Rex Guo
Monday, May 06, 2002

Uh , oh ! it means that the word count is not updated everytime the document is edited . It's just the result of the last Word Count execution by the user through the UI.
This info is based on the documentation provided by MS for Word-97 file format. I haven't verified it through reverse-engg though .

shailesh kumar
Monday, May 06, 2002

You can use Word's COM interface to get the word count of a document using most languages. Here's an example using vbscript that displays the word count of c:\test.doc in a message box

Dim word
Set Word = CreateObject("Word.Application")
Word.Visible = False
Word.Documents.Open("c:\test.doc")
MsgBox Word.Documents(1).Words.Count
Word.Application.Quit True

Don't know if Java can access COM though.

Matthew Lock
Monday, May 06, 2002

There are a number of packages that let you access COM from Java. Here's one from microsoft's site: http://www.microsoft.com/java/resource/java_com2.htm.

I haven't really used any of them, just know they exist.

Matt Christensen
Monday, May 06, 2002

There also exists an Apache project named POI.

http://jakarta.apache.org/poi/

Sebastian
Monday, May 06, 2002

I was recently looking the feasibility of rolling a VB app that used Word's COM interface into a DHTML app.  But, our web master didn't think it would work without installing Word on the IIS server, which he didn't want to do.

I had other things to work on, so I never checked it out.  Anyone know - can I simply register the Word COM interface on the server without intsalling the entire app?

Just Curious
Monday, May 06, 2002

Big thanks to everyone! Now I at least have a clue!

Rex Guo
Monday, May 06, 2002

More packages listed at http://www.geocities.com/marcoschmidt.geo/java-libraries-word.html

Marco Schmidt
Friday, June 06, 2003

*  Recent Topics

*  Fog Creek Home