Fog Creek Software
Discussion Board




Knowledge Base
Documentation
Terry's Tips
Darren's Tips

Pasting from MS Word

I have been trying to find an elegant solution to this problem for a week now but really, there is none. I can't ask my client to go through the steps of saving word documents as HTML and then copying the source into the HTML view of an article.

I bought into the whole concept of a CityDesk-based solution so that my clients don't have to know HTML. They should be able to copy directly from Word into CityDesk.

You should have some type of warning on FogCreek's website. I really can't recommend CityDesk for companies that have the content of their websites come from corporate documentation.

I just found out that the Ektron CMS has three levels of MS Word cleanup and one of them is exactly what I need. Pasting from Word works seamlessly. Maybe FogCreek should do that for the next version of CityDesk.

Thompson Marzagao
Thursday, January 29, 2004

> You should have some type of warning on FogCreek's
> website.

There is one, its called the free Starter Edition.

> I really can't recommend CityDesk for companies that
> have the content of their websites come from corporate
> documentation.

Reminds me of a banking group we visited a couple of days ago. Their (new) CMS is basically a Word-to-HTML converter with some wrapper app around. All content is edited in and submitted as Word files. Lets see in a couple of months ;-)

> I just found out that the Ektron CMS has three levels
> of MS Word cleanup and one of them is exactly what I

Unfortunately, if I remember correctly Ektron does not offer a proper license suitable to desktop apps like CityDesk (for their activex components that is, you are most likely referring too). And there is not a whole lot of players in that arena left.

Patrick Thomas
http://www.telepark.de/webwizards

Patrick Thomas
Thursday, January 29, 2004

> I have been trying to find an elegant solution to this
> problem for a week now but really, there is none.

Oops, what actually *is* the problem ?

Patrick Thomas
http://www.telepark.de/webwizards

PS: Should have asked that one first :-)

Patrick Thomas
Thursday, January 29, 2004

The problem is that MSWord puts lots of Microsoft specific 'clutter' in their files so that they can re-format them as ms-word files even if they are written as .html files.

It is annoying.  The topic has come up here several times.

There are work-arounds.  One of them is:
http://textism.com/wordcleaner/

Another way that almost works is to use Paste without formatting.  The trouble with that, is that it makes all text paragraphs -- it removes the <hx> tags that Word used.

Probably a "Tidy" solution could be built.

To include a Word Cleaner in CD would be a really nice feature.  Perhaps when Fogcreek improves the normal mode editor so that people can select text and apply <hx> tags, they might include this feature too.

joel goldstick
Thursday, January 29, 2004

I think some sites could "live with" pasting from MS Word. It pastes over just fine. If you accept that and are concerned about consistancy, you'll have to exert some control over how users format in Word. You need to specify a style and make sure the authors use it.

I did about 200k words in one site. My author sent in it as  Word documents in 10K chunks. I did a paste to notepad then to CityDesk for each one and had to put a lot of formatting back in. It was tough but I like the way it turned out.

In hindsight, I could have saved a lot of trouble and preserved the author's "look" if I had just pasted it directly from Word.

tk
Thursday, January 29, 2004

HTML pasted from MS Word is unacceptable even after using third-party tools (or Microsoft's own add-on exporter) to strip the round-trip information.  The main problem that I see is that MS Word uses span tags to hard-code the font sizes of all text, which keeps the end user from adjusting the text size using browser controls.

Paste without formatting is not acceptable for end user's who have a lot of content in Word--Joel's target market, BTW.  For example:  I work in a college where instructors keep their syllabi in MS Word files.  Pasting the Word HTML into CityDesk creates HTML which is not up to the assessibility standards, so that is out.  Pasting without formatting requires adding back all for formatting, for each of the instructor's classes--and each semester the information changes. 

The only palatable solution so far has been to turn CityDesk into a glorified database for PDF files:  Right click and choose save as to save the syllabus for this class.

David Burch
Thursday, January 29, 2004

> The problem is that MSWord puts lots of Microsoft
> specific 'clutter' in their files so that they can re-format

Sure but is that really Thompson's problem ? We have not heard from him so thats speculation.

> It is annoying.

Well, maybe worse things could befall the nation. Does the clutter "hurt" ?

> The main problem that I see is that MS Word uses
> span tags to hard-code the font sizes of all text,
> which keeps the end user from adjusting the text
> size using browser controls.

Thats at least a specific example. But is that really a problem to Thompson, his customer and his target audience ("adjusting text size using browser controls") ? If yes, how about pasting w/o formatting ? Does it really take that long to reformat (does it offset the obvious usability and speed benefits CD delivers) in Thompson's case ?

Reality is that even with Interwoven TeamSite (using the Ektron activex) lots of clutter finds its way into the CMS. Thats the difference between (theoretical) business processes and (individual) business practices.

I sympathize with the moaning about this issue but I rather hoped to hear what the actual "practical" and specific problems of Thompson's are.

Patrick

Patrick Thomas
Friday, January 30, 2004

Hey, don't get me wrong, I really like CityDesk, it's not CityDesk's fault. It's Microsoft's. :)

But I really thought that I was not going to have any problems pasting from Word and I was surprised at how bad the results are.

I spent three weeks building a content management system leveraging CityDesk's features. It totally separates content from layout. All of my formatting is specified in a stylesheet.

When I paste directly from Word into an article, as my client is planning to do, I end up with messy HTML code that results in unpredictable editing when in the article's normal view. By unpredictable I mean that if I copy the text in the wrong way, some left over bold tag might insist in making my text bold and clicking on the bold button doesn't remove it. Also, Word also inserts some {span style} tags that really make it impossible to change the formatting unless I go into HTML view. But my client is not HTML savvy.

I am going to define the MsoNormal and MsoBodyText classes in my stylesheet and that might help a bit, but that still doesn't solve the problem with the {span style} left overs.

Thompson Marzagao
Friday, January 30, 2004

One solution might be to leave the document in Word and just drag the document into CityDesk, link to it, and have it open as a Word Document. The author could do all his editing in Word and the document would appear exactly as he produced it. The page that links to the document could be CSS controlled.

I think in any CMS, some editing and management will be required.

tk
Friday, January 30, 2004

Thompson Marzagao :re problems pasting from Word

As long as you never go into html view "and" then switch back into normal view what you paste from word works great when published and viewed via IE 5.x and above.

Microsoft have done a fabulous job integrating MS Word and MS Access [including all the other office application] -- CD uses the MS Access DB engine called "jet". Now FC[CD] *html view* is proprietary to FC[CD] and does some "standards" stuff which is where all the word stuff “plus” gets whacked.

The issue is not that word produces word specific tags that upset the "standards people" ... the issue IMO is how FC[CD] massages the MS tags to suit the so called "open" community. :-)

All the stiff that I pasted into the CD “normal view” editor from MS Word publishes as just great and I like it. However, I also have to work in html view and when I do that I tend to use HTML Clean to remove the MS Word specific tags.

HTML Clean
http://songhaysystem.com/document.php?cmd=getDocCode&get=WD2KHTMLConversion

If you’re interested in HTML clean its integrates into MS word via macro. Its not the perfect solution but IMO does a great job retaining most of the MS Word formatting ins “standards way” ;-)

David Mozer
Friday, January 30, 2004

tk : Re: Linking directly to the Word documents:

The problem is that the client asked that some of his Word documents specifically also have a HTML online version, so I can't just tell him now that pasting from his original Word documents won't work and that he has to link to them instead.

David Mozer : Re: HTML Clean

I have tried the HTML Clean macro a few days ago. It does exactly what I need but it's way too slow to be useful. The Word documents I get are just too big and have to many tables to be processed by this macro in a reasonable amount of time.

Patrick Thomas and joel goldstick: Re: Pasting without formatting

It's not an option in this case. All documents contain tables and paste w/o formatting destroys the tables.

Anyway, thanks for the tips everyone.

Thompson Marzagao
Friday, January 30, 2004

Well it looks like some compromize, extra work, and extra learning is required by the parties.

That's what I've had to do. I've always considered the Word document to the prime source, the one the author has "on his computer." Once he has one version in Word and another in CityDesk, sync can be a problem. It is with me. Oh well, we've all got problems.

tk
Friday, January 30, 2004

I didn't dive into this at all but one idea enters my mind:
Isn't it possible to save as rtf and then have a stylesheet manage the rtf tags?

As said i haven't tried it and don't know if it is possible at all to let a stylesheet do anything with rtf tags...

Bert (NL)
Friday, January 30, 2004

A thought about those SPAN tags, etc.

Maybe using !Important rules in css could nullify spans and other unwanted tag markup?  In the end, just copy the whole msword doc as is, write css to ignore or alter.

joel goldstick
Friday, January 30, 2004

joel goldstick : Re: !Important rules

That's a GREAT idea!

I am leaving the office right now, but I'll try it first thing Monday morning and I'll let you know how things turn out!

Thompson Marzagao
Friday, January 30, 2004

I set up a site using CityDesk for a person who doesn't know any HTML. She hates CityDesk and wants to dump it because she can't just paste into it from MS Word without going into CityDesk HTML view and doing things that she doesn't understand. I keep telling her that CityDesk is far better than anything else available, but she isn't convinced. If Version 3  makes it possible to just paste from Word into CityDesk, Fog Creek will really have a winner.

Does Word 2003 do a better job of converting to HTML than the older versions of Word? Also, I've been playing with saving as HTML with the MS Windows version of Open Office 1.1. It seems to convert to HTML a lot better than Word does. It uses CSS, and there isn't a SPAN to be seen (unless you import a Word document). It does a better job than Word XP, but it isn't perfect. I wonder if someone who knows a lot more about these things than I do has any experience with Word 2003, Open Office and CityDesk?

Dick Dillon
Saturday, January 31, 2004

Hello all,

This is just to let you know, using !important rules and overwriting the MS Word styles helped. I did this:

.MsoNormal, .MsoBodyText, .MsoBodyText2, .MsoBodyText3, li, td, p, b, strong {

font-family: Verdana, Arial, Helvetica, sans-serif !important;
font-size: 10pt !important;
line-height: 12pt !important;

}

But I am having problems, because pasting directly from Word leaves some annoying {font size} and {font face} tags that still have precedence over the stylesheet.

If the font tags went away however, I think that everything would be fine. I might reluctantly try to do post-processing of the files using a text replacement utility like "ReplaceEm" (http://www.boolean.ca/replace/) to remove those font tags.

Thompson Marzagao
Monday, February 02, 2004

Did you run it through the Microsoft add-on html cleaner?

joel goldstick
Monday, February 02, 2004

joel goldstick : Word cleaner

I didn't run it through the cleaner because I don't expect my client to do the same.

This would add one extra step that I want to avoid if I can, as my client's staff will have to update the site themselves.

If I had to update the site myself I would run the documents through the cleaner, but that then defeats the purpose of building a CMS for them.

They don't care if it's a problem with Word or CityDesk. To them, it's my solution and therefore it's my fault if something doesn't work as intended.

I want to try to do the best I can so that my client doesn't have to worry about being able to paste text from where ever it is. To them it's such a basic operation that they don't see that it should cause any problems.

Thompson Marzagao
Monday, February 02, 2004

The MS Word Cleaner doesn't actually add an extra step: instead of "save as web page" they choose "export."

David Burch
Monday, February 09, 2004

The MS Word "Cleaner" I was talking about is Office 2000 HTML Filter 2.0: 
http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN

Here is an interested article that delves a little into stripping tags using VBA:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnofftalk/html/office120999.asp

David Burch
Monday, February 09, 2004

I have found that if you paste (or open the word doc) your document in Word Pad first (instead of Word directly). then pasting into a WSYIWG Html editor produces a relatively clean HTML document.

For example I made several versions of the same Page (Using Microsoft CMS!!!) and these are the results (in numbers of Chars):

Pasting direct from Word2000: 37194 chars
Pasting from WordPad:...are you siting down...6476 Chars.

This is a huge improvement and even better than using the Microsoft tool which produced an improved but still abismal 17640 Chars.

If you find a good aspx tool I would love to get it.

Mike Lee
Tuesday, March 23, 2004

Try the Microsoft 2000 HTML Mess Cleaner at http://www.algotech.dk/word-html-cleaner-input.htm . It's a free online tool that strips unnecessary code from Word-generated HTML.

It's ASP-based (not .net, classic) - and the sourcecode is available for developers to buy. It could easily be integrated into a Content Management System.

Morten Nilsson
Monday, August 30, 2004

*  Recent Topics

*  Fog Creek Home