Fog Creek Software
Discussion Board




word processor with CONSERVATIVE 'save as HTML'?

Ok so here's the deal.  I'm writing a rather long manuscript in Word, and Word is a wonderful thing.  Truly, truly it is.  I love Word, I want to have Word's children.  However:

For distribution, I not only have to send the .doc file to some places, but I also need to be able to upload the content as HTML to be include( ) 'ed in the content of a web page.  Now, the extra formatting crap that Word throws into every document you save as HTML has become problematic.  There are tons of meta tags, CSS blocks, superfluous <span> tags, etc.

I need (though this is probably too specialized and I'll end up writing my own) a word processing app that will save both as a doc file and as more "conservative" html.  That is, I want bold and italics (and really little else) to survive HTML-ification, but I don't need a huge CSS block, span's, meta's, etc.  An editor that would allow me to specify what formatting to preserve and what not would be EXCELLENT, but I fear that there may be no such beast.

Anybody?

muppet
Thursday, August 12, 2004

Here's one thought: Work some magic with regular expressions to strip out all the tags except the few you want to keep...

Kyralessa
Thursday, August 12, 2004

I've thought about that, and sadly, it seems that's what I'll end up doing.  I understand (and greatly enjoy) regular expressions, but I'm no expert yet.  I would just prefer that creating The Perfect Regexp to do exactly what I want not become the focus of my project.  You're right though, it's probably the solution.

muppet
Thursday, August 12, 2004

Possibly open the document with open office then try a save as html.  I've found open office html to be an order of magnitude better than ms office.  Might cut down on your regex work.

Mike
Thursday, August 12, 2004

Actually.. that may be exactly it.  Maybe I can just feed the whole thing through PHP's striptags( ) function and call it a day.  Though, it would still be nice not to have to clean up after Word.  :)

Thanks.

muppet
Thursday, August 12, 2004

another good suggestion, Mike.  I actually have OpenOffice installed and just didn't think to try it.

muppet
Thursday, August 12, 2004

Try AbiWord.


Thursday, August 12, 2004

I guess you must mean you want to keep the bolded and italicized tags and such themselves, not just keep your [B]old[B] and [I]talicized[/I]  tags themselves?

duly confused
Thursday, August 12, 2004

duly -

I have absolutely no clue of what you're trying to ask me.  Please rephrase in Standard English.  Thanks.

muppet
Thursday, August 12, 2004

I thought it was the best way to ask actually... it was a confusing concept.

You want to keep the behavior of you opening the document and seeing portions of the HTML code in bolded and italicized text right? Not just you want it to keep [B] and [I] tags and strip other tags out?

duly confused
Thursday, August 12, 2004

Correct.  Ideally in the editor I should see bold or italicized text, and in the output should be bare-bones html.  (not Frontpage-esque 3-page long <head> sections like Word gives)

muppet
Thursday, August 12, 2004

muppet, I think you mentioned in another thread that you're comfortable with Perl... you might want to check out HTML::Parser. It'll build the parse tree and I think all you'll have to do is traverse it, re-emitting the tags you want to keep and all the text and tossing everything else. Should be pretty quick (and if you haven't had the pleasure, parsing HTML with regexes can be much trickier than it first looks)...

John C.
Thursday, August 12, 2004

John-

Thanks for your input.  And yeah, the bulk of my introduction to programming was in Perl doing web development, so I'm intimately familiar with how much fun it is to parse HTML with Regexps.  That's why I don't want to do it.  :)

muppet
Thursday, August 12, 2004

Use HTML tidy with the word-2000 option.

Rhys Keepence
Thursday, August 12, 2004

Yup. tidy.exe -i -wrap 132 -u -c -latin1 < input.html > output.html

http://tidy.sourceforge.net/

Fred
Thursday, August 12, 2004

You could also save the file as an RTF file.  There are several utilities for converting RTF to HTML or XML.  They should let you pick and choose which tags you want to keep and which to discard.

http://www.rtf-to-xml.com/index.html
http://www.logictran.net/products/r2net.html
http://rtf2fo.com/

Robert Jacobson
Thursday, August 12, 2004

It might not be in the current collection, but one of the previously available XP PowerToys was a utility to strip away all Word-related content in HTML. I'm sure that if you google it you'll find somewhere to download it. I used it a number of times and it worked perfectly.

  --Josh

JWA
Thursday, August 12, 2004

Dreamweaver has option to fix Word HTML. Find it at macromedia.com

JD

JD
Friday, August 13, 2004

What version of word do you have. I have 2003 and it has the option to save as "Web page, Filtered" which saves a pritty minimal HTML. I think this feture was introduced in the XP (2002) version, but I'm not sure

This following was save from Word using the "Web page, Filtered" option.

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>Garys Home Page</title>
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin:0in;
    margin-bottom:.0001pt;
    font-size:12.0pt;
    font-family:"Times New Roman";}
h1
    {margin-top:12.0pt;
    margin-right:0in;
    margin-bottom:3.0pt;
    margin-left:0in;
    page-break-after:avoid;
    font-size:16.0pt;
    font-family:Arial;}
@page Section1
    {size:595.3pt 841.9pt;
    margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
    {page:Section1;}
-->
</style>

</head>

<body lang=EN-GB>

<div class=Section1>

<h1>Heading</h1>

<p class=MsoNormal>This is <i>important</i> and so is <b>this</b>.</p>

</div>

</body>

</html>

Gary van der Merwe
Friday, August 13, 2004

Luke Francl has already written a tidy-up script that fixes your specific problem.

http://luke.francl.org/software/word-unmunger/

Written in everyone's favorite language...Python!

lumberjack
Friday, August 13, 2004

Actually, something like CityDesk is perfect for this, or Open Office as that has less of a Sturm und Drang approach to CSS.

Simon Lucy
Friday, August 13, 2004

Just creating a web page with the DHTML Edit control could do it, but he's already professed his undying love for MS Word and his need to use the .doc versions. I don't know if he needs to keep it in .doc format the whole time, maybe he could just maintain it in HTML and import it into word.

www.MarkTAW.com
Friday, August 13, 2004

Textism provides a Word HTML cleaner.  If you're so inclined, you can subscribe, but I think it's free for one-off/personal use...

http://textism.com/wordcleaner/

Tom (a programmer)
Friday, August 13, 2004

Ah.  Seems to have a 20Kb limit for non-subscribers.

http://textism.com/wordcleaner/?subscribe=1

Tom (a programmer)
Friday, August 13, 2004

KOffice's KWord outputs some pretty nice HTML, and has three levels of fancification. Not on Windows, though.

Thom Lawrence
Friday, August 13, 2004

I have the ultimate solution to your problem but I have decided not to share.

You may want to look up Karma in your Peanuts Dictionary.

Dutch Boyd
Friday, August 13, 2004

muppet, what is this large manuscript? Is it the muppet manifesto?

moo
Friday, August 13, 2004

<Sermon>

This is why content oriented document formats with styles applied later is superior to the appearance oriented file formats used by most word processors.  Edit based on meaning; style and format appearance separately.

I'm curious to see if this scheme gains popularity as more and more people become exposed to the CSS/HTML/XML concepts.  Repurposing content is probably more important today than ever.

Some XML editors now provide real time WYSIWYG editing for XML files using CSS or XSLT to translate between the actual document structure and what the rendered document will look like.  It will be interesting to see if this idea catches on.

</Sermon>

Jim Rankin
Friday, August 13, 2004

I'm sure he's preaching to the chior here, but I just want to add a "hear hear" to Jim's sermon.

How effective is Lyx in moving toward that end?  What about the scuttlebutt that MS Office went to XML as the underlying file format for Office XP(?) ?  Did that help at all?

OffMyMeds
Friday, August 13, 2004

<Silly>

On a totally different note, did anyone think that muppet wanted his HTML to look like Rush Limbaugh's website or FreeRepublic.com after reading the thread title?

</Silly>

Jim Rankin
Friday, August 13, 2004

*  Recent Topics

*  Fog Creek Home