Fog Creek Software
Discussion Board




tidying up arbitrary html code

i get a chunk of text which might contain html tags (not a complete website just some <b>'s and <u>'s and maybe an <img> or <a>).

i want to clean up that code. that means i want to check if the tags are nested properly if they are opened before they are closed if they are redundant if i can combine some tags etc.

someone suggested in another thread to create a dom and work with that. now my question: do you know a way to do so without a dom? just some string comparisons etc.?

my language for this project is javascript although any code or pseudo code is fine.

thanks in advance for any help and please excuse my grammar. :-)

dan
Tuesday, February 24, 2004

When is this homework assignment due?

Ron
Tuesday, February 24, 2004

HTML lets you mismatch tags.  HTML with mismatched tags is perfectly valid HTML.

XHTML insists that tags are properly nested etc.  That can be validated with the XTD or whatnot.

i like i
Wednesday, February 25, 2004

Actually, on second pass I reckonise a situation I've been in where what you ask has been required:

Often, message-boards etc allow html in the messages.  However, you have to ensure that html doesn't 'seep' out and mess up the hosting page.

In the days I tackled the problem I just escaped all < and > signs that weren't exactly matching patterns of html that I allowed: these being stuff like <b> and </b> and <i> etc.  I was kind-enough to allow uppercase combinations, but not combinations with spare spaces etc.  At the end of the the message I always appended </b></i> etc to ensure there was no overflow.  (To do the replace, first escape all < and >, then go replacing &lt;b&gt; with <b> again.)

However, these days I might first check out those new SPAN and DIV and the scope of stuff inside TDs and what not and see if you can simply stop the html in the message seeping by containing it.  (You might want still to zap lots of html, e.g. iframe is nasty, img might be used to track users, what would happen if someone wrote </html>? etc; ok ok, I talk myself back to the first approach!)

i like i
Wednesday, February 25, 2004

Have I not replied to this post enough already?

A final point (I promise), client-side javascript would *not* be the place to put this (if the situation I outlined above is the problem you are trying to solve).

Things like this should be server-side (part of scrubbing and validating input upon submission).

All client-side validation that I've seen has failed in the field on different browsers and different security settings and javascript support etc.

i like i
Wednesday, February 25, 2004

"HTML lets you mismatch tags.  HTML with mismatched tags is perfectly valid HTML."

No it isn't, actually. Chances are that it will still display correctly because browsers are very forgiving of badly written HTML but it's not valid. Use the W3C validator at http://validator.w3.org/ to check.

John Topley (www.johntopley.com)
Wednesday, February 25, 2004

I interpreted the "dom" suggestion as in work on the model part of your MVC.
When you have the selection on the model, how hard can it be in your simple case to just split out e.g. the bold operation into different sections over the tree? You can even do it by just simply going linearly through an HTML string if you want.

Just me (Sir to you)
Wednesday, February 25, 2004

htmlTidy is a utility to clean up, with lots of nice configuration.

Aaron Lawrence
Wednesday, February 25, 2004

http://infohound.net/tidy/

&#8362;
Thursday, February 26, 2004

*  Recent Topics

*  Fog Creek Home