Fog Creek Software
Discussion Board




Combating SPAM

So how would a bayesian filter stop this (taken from an emails html source):

we have 1000 exclusive ha<!DI>rd<!EEPTI>core phot<!AX>os with lit<!ADNE>tle, ta<!IZABO>sty ch<!JE>ildr<!VIZY>en and over 300
Megabytes of high qu<!WYEDTI>al<!ATEOL>ity ha<!SDOR>rdc<!ASQDI>ore C<!MPURZEE>P vide<!NEV>os

Is there anything out there that could work with this?

Jack of all
Friday, July 30, 2004

argh, ignore markup?

i like i
Friday, July 30, 2004

I guess in principle comparing the text components to a dictionary would work. If it doesn't achieve a high enough "score" (i.e. if too much of the content was garbage) then junk it. It might also have the beneficial side-effect of making people take a bit more care over their spelling when sending email :-) Hmm, maybe it should check grammar as well...


Friday, July 30, 2004

A bayesian filter would index the end result that a human sees.

From http://spambayes.sourceforge.net/background.html : "In the end, the best results were found by stripping out most HTML clues."

From http://spamprobe.sourceforge.net/ : "Ignores HTML tags in emails for scoring purposes unless the -h command line option is used. Many spams use HTML and few humans do so HTML tends to become a powerful recognizer of spams. However in the author's opinion this also substantially increases the likelihood of false positives if someone does send a non-spam emai containing HTML tags."

Nate Silva
Friday, July 30, 2004

The short answer is "Practice". 

The longer answer is that the existance of a trick like that becomes a clue to identify spam.  So spammers will be forced to change tricks as they go along and it will get harder for them. 

On a personal level, yesterday 44 spam e-mails went to my junk folder. Of those 5 initally went into the suspect folder plus one which was Ok.  None got through.  When there's a change in tactics maybe 1 or 2 get through and I get ~10 in the suspect pile.  Given the alternative is to sort through them myself, I can live with that hit rate.

a cynic writes...
Friday, July 30, 2004

The dictionary thing could also work against this:

http://www.marktaw.com/temp/sexvideos.txt

One of the more clever emails I've received. Each chunk of characters in "CLICK HERE" is actually a *different* link, I think to Geocities pages. Probably redirects.

www.MarkTAW.com
Friday, July 30, 2004

Geeze you guys post fast.

www.MarkTAW.com
Friday, July 30, 2004



Quite a few people are working on simple perl scripts to strip out the tags and then run the regular bayesian filtering and then a spellchecker.

KC
Friday, July 30, 2004

How many hits did you get on your sex videos line, MarkTAW? <grin>


Friday, July 30, 2004

Hey, wow, you got the TAW as all caps. Most people think it's a word.

www.MarkTAW.com
Friday, July 30, 2004

> So how would a bayesian filter stop this (taken from an
> emails html source):

A bayesian filter would stop this because it doesn'tt have any ham words in it, no ham words means it won't end up in your inbox. For an email to get in your inbox it would have to have ham words *and* minimal spam words.

This will help you understand: http://www.paulgraham.com/better.html

By the way I think the fact that spammers are now doing this proves that bayesian filters are causing them enough problems that they have to try a get round them. 

Matthew Lock
Saturday, July 31, 2004

*  Recent Topics

*  Fog Creek Home