Fog Creek Software
Discussion Board

Knowledge Base
Terry's Tips
Darren's Tips

CityDesk comment in JOS about "cleaner" HTML

Joel comments on CityDesk:
"In the next release of CityDesk we're doing really heroic amounts of work ..."

Tuesday, July 8, 2003

Good tea leaf reading :)

I finally got it mostly working. It's really unbelievable how much work it took, and it's only 99% done as we speak.

Here's the story. I had the new source preservation feature working nicely, using the DHTML editor (from IE) in "source preserving" mode. In CD 1.0, editing templates and HTML files is done in source preserving mode while editing articles is done in non-source preserving mode.

Unfortunately then we discovered that the DHTML editor had a bug when working in "source preserving" mode: if you wrote

This is a sentence.

Then selected the word sentence and made it bold, you would get the following HTML

This is a <strong>sentence</strong> .

Looks good right? Look closer. There's an unwanted space before the dot.

This wee tiny bug was completely non-work-aroundable. It's a bug in IE and we had no way to fix it.

Instead, I wrote a complete HTML source preservation system basically from scratch, so that your indenting and line breaks are preserved in the WYSIWYG editor. Basically, before we shove your HTML into IE for editing, we find all the whitespace and replace it with a custom tag that looks like this:

<cd:preserve whitespace="xxx">

xxx is an encoded version of the whitespace that used to be there, for example CLCL means two crlf's and a tab. This custom tag is completely invisible in the editor but IE does preserve it for us. When the HTML comes back from IE we search for those things and replace them with the original whitespace.

Then I incorporated the source code from TidyLib (a.k.a. HTML Tidy) to clean up the code that comes back from IE in non-source-preserving mode and make it a little less offensive:

* all tags will be closed
* everything will be cleaned up into valid xhtml
* attributes will be quoted
* unclosed tags are done the xhtml way: <br> becomes <br />
* everything is lowercased
* etc. All the xhtml stuff, now built in.

Effectively this gave us an xhtml compliant wysiwyg editor, but there was more work to do.

The next bug I wanted to workaround was the fact that IE tends to sprinkle extra &nbsp; in your code. Why? Because if you type foo, space, space, bar, IE wants to preserve both spaces so it has to convert the first one to an &nbsp;. Then if you use the cursor to delete the SECOND space you are left with foo&nbsp;bar. IE doesn't think this is wrong. I do. So now there is an incredible amount of logic in CityDesk to convert these &nbsp;s back to regular spaces. It's complicated, because any *real* &nbsp;'s -- either the ones you put there yourself, or ones that are necessary to get several blank spaces in a row -- must not be tampered with. Hard.

While I was at it, there was another problem. Look at this:

{$ forEach ... $}
<li>{$ x.headline $}</li>
{$ next $}

The forEach and next statements are located in a place where it's illegal to have any text. The only thing legal in a <ul> is a <li>. So IE was moving them around, either losing them or putting them before the <ul> or at the end of the page or some such travesty.

To fix that problem we had to figure out all the places you might put CityScript that the IE HTML editor wouldn't be happy about, and protect it for you. Places like:
* in a <ul> outside a <li>
* in a <table> outside a <tr>
* in a <tr> outside a <td>
and a few others. In all those cases we now "protect" any CityScript we find in those places by wrapping it in a temporary tag which we strip out on the way back. It's absurdly complicated but the good news is It Just Works.

All this, because I don't think it's OK for CityDesk to be inserting extra spaces in front of your .'s where you didn't want them...!

Joel Spolsky
Friday, July 11, 2003

Joel - if people don't tell you this enough, YOU RAWK. Kudos to everyone @ Fog Creek for putting out an excellent product and continuing to provide top notch support & development.
Friday, July 11, 2003

Thanks Joel!

You say you incorporated the source code from TidyLib - HTML Tidy. Does this mean the garbage html you get when pasting from MS Word will be cleaned up too?

Paul Iliano
Saturday, July 12, 2003

Hip, Hip, Hooray!!!

David Burch
Saturday, July 12, 2003

The improved editor sounds great. Joel said: "unclosed tags are done the xhtml way: <br> becomes <br />"

Will that validate as HTML 4? If we had been creating valid HTML 4 transitional with CityDesk v1, are we looking at switching to XHTML 1 transitional  if/when we upgrade to CD v2 (if we want to keep validating)?

Pete Riis
Sunday, July 13, 2003

<br> </br> won't validate as HTML 4.

I imagine you'll have to change the html in your templates to become xhtml compliant but that shouldn't be a big job. My only concern is all the articles that CityDesk 1.x has already created. The thought of opening every single one and saving them to trigger the html tidying could be quite a repetitive task...

Still, it's all good fun!

John C
Monday, July 14, 2003

It would be great if we had the option of staying with 4.01 transitional. I seem to remember the problem with XHTML was with unclosed <p> tags as a result of CityScript loops, which are OK in HTML 4 but not in XHTML.

Pete Riis
Monday, July 14, 2003

It would be great if CD read the doctype and used html if doctype didn't say xhtml

Joel Goldstick
Monday, July 14, 2003

In general, our approach is going to be xhtml only. We don't have the resources to produce both HTML 4.0 and xhtml 1.0 valid code, and the more popular choice for people who like to produce validating web sites has long been xhtml.

Joel Spolsky
Monday, July 14, 2003

I can see that.  Now where did I put that doctype list?.......

oh, here it is:

Joel Goldstick
Tuesday, July 15, 2003

Oh, my godness. I really can feel the pain, since I've had my own quote of pain working with DHTMLEdit and now MSHTML.

Leonardo Herrera
Wednesday, July 16, 2003

Are you also converting & in a link to &amp; ?

Phillip Harrington
Saturday, July 19, 2003

Speaking of the DHTML Control and MSHTML, when the hell is Microsoft going to get around to rewriting (or replacing) this component -- the guts of IE -- so it doesn't produce such god-awful code? At this point in the evolution of IE and W3C standards I'd say it's about time...

[PS: Would this problem fall with the scope of the System.CodeDom and System.CodeDom.Compiler namespaces of the .NET Framework? Perhaps that's where they're headed -- a fully managed (X)HTML browsing and editing implementation in Longhorn...]

Chris Weed
Wednesday, November 26, 2003

FYI -- an XHTML 1.1 compliant ActiveX editing component:

There is a freeware 'Lite' version and a 'Pro' version.

Chris Weed
Sunday, November 30, 2003

*  Recent Topics

*  Fog Creek Home