Fog Creek Software
Discussion Board




Welcome! and rules

Joel on Software

Find/Replace character in unicode do at byte level

How would you find/replace characters in a unicode document at byte level (in .net of course)?

That is, loading the document as a byte stream and replace/write to new file on the fly, as reading the original file.

Very important: I need to do that with Word/Power-point/Excel (or even text) documents, and possibly using the same code to process each of these different formats.

Also: would I need to skip some proprietary Office headers first, if I don't want to corrupt the file? Which version of Office are saving files in unicode?

I've been searching the web about that but didn't ccome up with anything valuable yet..

Thanks for any hints or pointers!

hr.macintosh
Wednesday, November 10, 2004

Title should be:
"Find/replace character in unicode at byte level"
(where did that "do" come from ??)

hr.macintosh
Wednesday, November 10, 2004

That's going to be non-trivial, especially if you want to cater for Office files, Word, Excel, etc.

I would guess that Office files are streamed, i.e. lots of pseudo-files within the main file, so perhaps looking at the structured storage specifications would be good, although it'll probably be different for each version.

A different approach would be to open the files, using the COM objects that are Office, and use them to manipulate the files. They have find/replace functionality built in. However, you seem to preclude that option in your o/p.

Nemesis [µISV]
Wednesday, November 10, 2004

I feared that.. I'd rather go for the Office interop then..

Did I understand the readme correctly, Office doesn't need to be installed on the server, the Office XP PIAs are sufficient?

hr.macintosh
Thursday, November 11, 2004

*  Recent Topics

*  Fog Creek Home