Fog Creek Software
g
Discussion Board




Workaround to Bayesian filters?

Apologies if this is old news.

Ran across this interesting piece[1] via Boing Boing[2]:  apparently some spammers have started pasting chunks of text from public-domain books into their messages, in an apparent attempt to improve their overall "not-spam" average.

[1] http://news.bbc.co.uk/1/hi/technology/3247200.stm
[2] http://www.boingboing.net

Sam Livingston-Gray
Tuesday, December 2, 2003

It is old news, and it doesn't work unless you receive a lot of legitimate messages with chunks of public domain books in them.

I just went through my companies filtered spam (I try to double check it once a week or so) and noticed quit a few lines I recognized (The Jungle Book mostly).

At least they're trying to be inventive.  Maybe if they put the entire book in the message with their spam interspersed between the chapters more would get through.

Steve Barbour
Tuesday, December 2, 2003

To me, this raises a very interesting question.  Presumably, anybody who goes to the effort to use a spam filter does so because they don't want to read spam, which means that they find it to be 100% worthless and never respond to any of it.

So why does the spammer try to modify his mails to get around the spam filter then?

It seems to me that it must be because there are people who use spam filters who are still responding to the spam that gets around the filters.

Foolish Jordan
Tuesday, December 2, 2003

A lot of email accounts come with at least some level of spam protection for free nowadays, so I guess the spammers are just trying to get around that.

r1ch
Tuesday, December 2, 2003

>> So why does the spammer try to modify his mails to get around the spam filter then?

Because the spammer (usually) isn't paid based on responses, he's paid based on the amount of spam he successfully gets into people's mail boxes.

The 'high end' spammers are people who sell their spam service to people who want their 'marketing email' seen by a lot of people.  The spammer doesn't care if you "just hit delete" - he's made his money.

RocketJeff
Tuesday, December 2, 2003

"...  it doesn't work unless you receive a lot of legitimate messages with chunks of public domain books in them."

I hate to be argumentative, but that's simply not the case.  My Bayesean filter was working great for the last several months, until a week or two ago when the spammers started doing this trick.  Now I get three or four spam mixed in with my real email.

Grumpy Old-Timer
Tuesday, December 2, 2003

Well, it hasn't affected ours, and I didn't figure ours were any better than anybody else's.

I'd be curious to as to what kind of legitimate emails you get, and if further training will improve the situation.

We get about 30,000 spam messages per day, so we have a rather large base of spam to train from.

Steve Barbour
Tuesday, December 2, 2003

My filter (mozilla-mail's built in one) is handling this kind of thing just fine.  I've trained it on the ~50 spams I receive a day.

It won't take long till you start filtering these out,  remember, bayesian filtering is adaptive,  when the traits of spam change,  the filter has to spend some time to adapt.  If your spam corpus is too small, then V1a.gr4 probably doesn't register,  but once it does, all the jungle book in the world isn't going to stop the filter from counting the mail as spam.

Michael Koziarski
Tuesday, December 2, 2003

The problem arises not because spam gets into your inbox, because you add that to the filter but that perfectly good emails get wrongly attributed as spam.

Simon Lucy
Tuesday, December 2, 2003

I've seen this happening.  I have to admit the resulting spam looks really weird with random sentences in it.

christopher baus (www.baus.net)
Tuesday, December 2, 2003

Bayesian spam filters are wonderful.  Even if a few spam messages slip through, the resulting message is so garbled that its unlikely to make an effective marketing pitch.  Less quality marketing means less buyers.  Less buyers means less profits.

josReader
Tuesday, December 2, 2003

The filters eventually eliminate b;u;y; vi!ag.r.a and the like since they have a lot of punctuation and single chars in the subject.

Contaminating your 'good words' list was predicted but will eventually be handled and the good words the spammers use and those your friends use will differ.

And they still need to get that URL in the somewhere.

Eventually spam will just come down to "Hi, look at www.ilovespam.com'.

Speaking of Hi, what's with the emails with just 'Hi' in the subject, with no body?  (ie, no web bugs).  Someone collecting bounce messages to confirm address?  I've not seen it mentioned, and been to lazy to find out...

AJS
Wednesday, December 3, 2003

Simon, you say and you're getting too many good emails marked as spam. I suffered from this for a while, until I changed the cutoff point (I'm using SpamBayes ) from the default 90% to 98%. Now none of my good emails are marked as spam, though I got to see a little more real spam in my inbox.

Breandán Dalton
Monday, December 8, 2003

*  Recent Topics

*  Fog Creek Home