Fog Creek Software
g
Discussion Board




The simplest way to filter in HTML tags

Hi

I'm trying to write a routine that scans through a piece of HTML code and decides whether or not certain HTML tags should be removed or retained.

Let's say the function is called RemoveUnwantedTags(string TagsToKeep)

and the given HTML is:
<font>The Quick Brown Fox</font><b>Jumped</b><i>Over</i> <u>the Lazy Dog</u>

If TagsToKeep has the value "<b>,<i>", the returned valued would be:

The Quick Brown Fox<b>Jumped</b><i>Over</i>the Lazy Dog

Only <b> and <i> tags are kept. Other tags are removed.

My routine currently uses regular expressions to pick out all <tags></tags> and compare them against tags defined in the TagsToKeep property. If its a match, the tags are kept otherwise, they are removed.

But something tells me this is not the most efficient way of solving this problem. Does anyone know of a neater and more efficient solution? Any hints are welcomed!

Thanks!

anon
Tuesday, May 11, 2004

Efficient as in quickest runtime or efficient as in quickest to program??

If your regex effort works well enough, and it's written, why change?

If runtime performance is crucial, consider finding/writing a tokeniser to do this work for you.  The hit, I guess, with the regex is that you scan the entire file for one tag, then another, then another etc right?  A single pass is probably quicker.  And more complex.

i like i
Tuesday, May 11, 2004

Exactly, you've hit the nail on the head. The solution I have now requires two nested for loops.

One to iterate through all the tags found within the HTML. And another to see if the tag matches.

Ideally, it would be done in a single regular expression. One that says "find all tags that do not match the list"

or

"find all tags that match the list" and take the inverse of that.

I'm not sure if either are possible...

anon
Tuesday, May 11, 2004

This isn't the answer to your question, but perhaps the solution is to just provide your own HTML-esque syntax to support whatever features you are after.  Then your reg-ex would be a simple replacement of all > and < with &lt; and &gt; 

I guess it really depends on how many tags are in your list.  Most discussion boards I've seen that allow markup only allow so in this way.

Seeker
Tuesday, May 11, 2004

"Then your reg-ex would be a simple replacement of all > and < with &lt; and &gt;"

What about & with &amp; and there are others I think.

John Topley (www.johntopley.com)
Tuesday, May 11, 2004

Maybe use php:

http://us3.php.net/strip-tags

Tom H
Tuesday, May 11, 2004

I once did this in asp by first replacing all < with &lt; and all > with &gt;. Then when that was done I replaced all &lt;b&gt; with <b> and so on. Probably not the best way but simple and worked.

ASPguy
Tuesday, May 11, 2004

Would it be any quicker? I mean, each tag found has to be compared to the list of tags to keep at some point.

Aside: doing it yourself is harder than you might think when you realise you can have white space and comments and php code and such like betweeen '<' and '>'

Jack V.
Tuesday, May 11, 2004

If you really need it to run lightening-fast, you could try writing a smart routine in C to do it for you with the minimal number of character comparisons, lets say for educational purposes - I'm sure there's fast parser/tokeniser code out there you could use but it would probably be overkill. Sometimes its nice to think about the algorithms though.

One first idea would be:

Have it build up a character-by-character decision tree in memory in a suitably layed out data structure at the start of the program from the list of allowed tags, and then use this to process any strings you throw at it. Should be pretty fast. It would reject a tag as soon as it came across a character that couldn't occur in a valid tag given the previous ones, going along character by character through its decision tree.

Even faster might be writing something to generate C code to implement the decision tree as a bunch of if/elseif/else statements and *ptr++ == 'x' type comparisons, if the list of allowed tags is fixed. Although the cache might not like that so much, and it'd sure look ugly :)

Matt
Tuesday, May 11, 2004

Other things to bear in mind: often the key reason for limiting tags is to stop people uploading malcious scripts. Scanning for <script>...</script> is fine, but also consider that script can live in all sorts of other places:

  * <tag onclick="..." onload="..." on___="..." >
  * Behaviours (where you can associate script with elements via styles - which I think is IE-only)

And then Server.HtmlEncode() the rest (which is in classic ASP as well as .NET if you're using MS technologies).

Duncan Smart
Wednesday, May 12, 2004

"One to iterate through all the tags found within the HTML. And another to see if the tag matches."

Make your regex smarter.

(Here's hoping this formats right - preview would be a nice feature...)

"&lt;(b|i|font)&gt;"

5v3n
Wednesday, May 12, 2004

"<(b|i|font)>"

5v3n
Wednesday, May 12, 2004

*  Recent Topics

*  Fog Creek Home