Fog Creek Software
Discussion Board




Tidbits - Bayesian Anti-Spam

I've noticed an increasing tendency for spammers to included random meaningless strings in their headers & email text.  I had no idea why - until I saw this comment by Joel.  Presumably, the random text stings mess with the Bayesian analysis.    Does that sound right?

Thomas Jones
Thursday, November 27, 2003

Yes it's an attempt to subvert Bayesian filtering and that checksum filtering that some ISPs use.

I use a Bayesian filter and the random text doesn't seem to help spam get past the filter, as there are already too many incriminating words in spam.

Matthew Lock
Thursday, November 27, 2003

There was another thread about this below.  Basically some people said it killed their filters and others said it had no impact.  Obviously, there are many ways to implement Bayesian filtering.

Like Joel, I use SpamBayes, and the random sentences really haven't had much impact on its accuracy.

I think it'll come down to how much other text is in the email, how many emails you've trained, and how your filter deals with words it hasn't seen before.  Most likely, the "random" words won't match up too well with your good emails, so most of the words should be scored as neutral and have little impact on the score of the email.

David
Thursday, November 27, 2003

I just had another spam message today in HTML.  It had large blocks of "plain English" (non-spam) text, but in white characters so they weren't visible.  Unless the filters are smart enough to distinguish visible text from invisible text, it seems like a devious method to circumvent filters.

Robert Jacobson
Thursday, November 27, 2003

Most bayesian filters tokenise and score everything - so invisible text ends up with a really high spamming score.

Matthew Lock
Thursday, November 27, 2003

Well it only ends up with a high score if statistically it conforms to whatever you've taught the filter.  If the counter-weight text matches what you've taught it is allowable  then it will tend to make the whole email allowable.

Simon Lucy
Thursday, November 27, 2003

> If the counter-weight text matches what you've taught it is
> allowable  then it will tend to make the whole email
> allowable.

It will only make the email allowable if it contains no tokens which were classified as spam.

The spammer could include the entire text of a neutral book, or exactly the kind of text I am interested in, but a few mentions of viagra or get-rich-quick words and it would be classified as spam.

Matthew Lock
Thursday, November 27, 2003

Wouldnt common colours like FFFFFF get a really high spam score?

The only reason to use white would be if you wanted to hide something.

Perhaps we need smarter filters which weigh random/neutral tokens less, and ham/spam tokens more, which should make random strings less effective.

Annoyed that there is no anonymous posting.
Thursday, November 27, 2003

Read Paul Graham's articles.  He answers most of this stuff.
http://www.paulgraham.com/antispam.html

SomeBody
Thursday, November 27, 2003

What "Somebody" said.  Read Paul Graham's articles.  From what I can tell SpamBayes uses most of his ideas.

Basically, the only way the neutral text can work is if it looks like the legitimate emails you get.  How likely is that?  Not very.  There really is a fairly small range of words that are common in your Ham emails.  A spammer just isn't that likely to hit them often enough to offset the other content in his email.  The spam is going to have to have some text that tries to sell something or get you to click on a URL.  There are only so many ways you can make that pitch and have it not match text that's already labeled as spam.

And if the random hunk of text includes words you haven't seen much in your ham, they won't affect the final score much.

David
Thursday, November 27, 2003

Spambayes maybe filtering all of Joel's email's, but i doubt it's really stopping spam for his side. The real problem is that you still have to download all mail to your local inbox to get that mail sorted out, resulting in waste of time and some bandwidth, what should happen is that email should be sorted and processed in the server.
The best solution i've found is called Mailwasher www.mailwasher.net , it checks your email in the server and you can select wich is spam and wich is not, it let's you mark messages for delete, bounce and blacklist. If you bounce  the message the sender will know that the email is not valid anymore and eventually remove your email from spam lists wich is good, by marking msg's as blacklist over time the system will learn to classify them and all you have to do is download the mail that matters.

pvf
Friday, November 28, 2003

Do these spam filters work OK with Antivirus software installed?

Interaction Architect
Friday, November 28, 2003

"If you bounce  the message the sender will know that the email is not valid anymore and eventually remove your email ..."

<quibble>
Very few spammers care about bounced messages.  Most use forged or disposable email addresses, after all, so they won't even receive the bounced messages.  From the spammer's perspective, it's more efficient just to keep sending messages to everyone, including invalid addresses -- removing addresses costs time (and therefore money.)
</quibble>

Robert Jacobson
Friday, November 28, 2003

I wonder how long we're going to go without really commercial, well funded operations getting into spamming. Operations that can afford to hire techies to defeat spam filters.

It might seem gloomy, but most spammers seem to be one man operations that just shoot out emails. When the spam bubble busts, and corporations stop paying for spam to be sent, you'll start seeing smarter spammers. Then Bayesian filters will have a run for their money. Right now, the Bayesian Filters seem to be winning (I still use PopFile, which is pretty good too)

deja vu
Saturday, November 29, 2003

---"really commercial, well funded operations getting into spamming"---

And get themselves sued to high heaven? Or watch their stock price plummet because of lack of goodwill?

Beating Bayesian filters will only work if they become a factor. They aren't at the moment because the guy  who puts a Bayesian filter on has long before stopped being an official target for a spammer.

Stephen Jones
Sunday, November 30, 2003

What world is this where we care more about the mail we don't want than the one we want....oh, i forgot, it's the Internet...

pvf
Sunday, November 30, 2003

Mailshell Inc.,

antispam most difficult to set up. getting technical support is practically asking for hell!!!

downloaded and pay 29.95 and not responsive to the customer's requesting for help.

you said i am sour grapes - yes...after paying 29.95 for nothing - just one email from them and not telling much :( so gotta resort to sour grapes:(

Charles Chan QH
Thursday, July 29, 2004

*  Recent Topics

*  Fog Creek Home