word processor with CONSERVATIVE 'save as HTML'?
Ok so here's the deal. I'm writing a rather long manuscript in Word, and Word is a wonderful thing. Truly, truly it is. I love Word, I want to have Word's children. However:
Here's one thought: Work some magic with regular expressions to strip out all the tags except the few you want to keep...
I've thought about that, and sadly, it seems that's what I'll end up doing. I understand (and greatly enjoy) regular expressions, but I'm no expert yet. I would just prefer that creating The Perfect Regexp to do exactly what I want not become the focus of my project. You're right though, it's probably the solution.
Possibly open the document with open office then try a save as html. I've found open office html to be an order of magnitude better than ms office. Might cut down on your regex work.
Actually.. that may be exactly it. Maybe I can just feed the whole thing through PHP's striptags( ) function and call it a day. Though, it would still be nice not to have to clean up after Word. :)
another good suggestion, Mike. I actually have OpenOffice installed and just didn't think to try it.
I guess you must mean you want to keep the bolded and italicized tags and such themselves, not just keep your [B]old[B] and [I]talicized[/I] tags themselves?
I thought it was the best way to ask actually... it was a confusing concept.
Correct. Ideally in the editor I should see bold or italicized text, and in the output should be bare-bones html. (not Frontpage-esque 3-page long <head> sections like Word gives)
muppet, I think you mentioned in another thread that you're comfortable with Perl... you might want to check out HTML::Parser. It'll build the parse tree and I think all you'll have to do is traverse it, re-emitting the tags you want to keep and all the text and tossing everything else. Should be pretty quick (and if you haven't had the pleasure, parsing HTML with regexes can be much trickier than it first looks)...
Use HTML tidy with the word-2000 option.
Yup. tidy.exe -i -wrap 132 -u -c -latin1 < input.html > output.html
You could also save the file as an RTF file. There are several utilities for converting RTF to HTML or XML. They should let you pick and choose which tags you want to keep and which to discard.
It might not be in the current collection, but one of the previously available XP PowerToys was a utility to strip away all Word-related content in HTML. I'm sure that if you google it you'll find somewhere to download it. I used it a number of times and it worked perfectly.
Dreamweaver has option to fix Word HTML. Find it at macromedia.com
What version of word do you have. I have 2003 and it has the option to save as "Web page, Filtered" which saves a pritty minimal HTML. I think this feture was introduced in the XP (2002) version, but I'm not sure
Gary van der Merwe
Luke Francl has already written a tidy-up script that fixes your specific problem.
Actually, something like CityDesk is perfect for this, or Open Office as that has less of a Sturm und Drang approach to CSS.
Just creating a web page with the DHTML Edit control could do it, but he's already professed his undying love for MS Word and his need to use the .doc versions. I don't know if he needs to keep it in .doc format the whole time, maybe he could just maintain it in HTML and import it into word.
Textism provides a Word HTML cleaner. If you're so inclined, you can subscribe, but I think it's free for one-off/personal use...
Tom (a programmer)
Ah. Seems to have a 20Kb limit for non-subscribers.
Tom (a programmer)
KOffice's KWord outputs some pretty nice HTML, and has three levels of fancification. Not on Windows, though.
I have the ultimate solution to your problem but I have decided not to share.
muppet, what is this large manuscript? Is it the muppet manifesto?
I'm sure he's preaching to the chior here, but I just want to add a "hear hear" to Jim's sermon.
Fog Creek Home