Fog Creek Software
Discussion Board




Spammers now thwarting Bayesean filters

I use two different Bayesean filters (one at home, one at work), and spammers have figured out lately how to get around them.

All they do is include a sentence of two of random text, and it causes the filters to generate a false negative.

I knew it was too good to be true...

Grumpy Old-Timer
Monday, November 24, 2003

Congress is in the process of passing legislation requiring some kinds of spam such as adult content, to indicate this in the subject line.  They also have some stuff about not faking return addresses and so forth.  All this should make filtering much more effective.

Name withheld out of cowardice
Monday, November 24, 2003

Then you have bad filtering software, you've miscategorized some spam or you don't get enough spam to properly train the filters.

Spammers have been using the random words technique for quite a while now and it's had almost no effect on my filter (SpamBayes).  But then, like Joel, I get about 200 a day, so I have a very robust training supply.

At best, the random words spams, or even worse, the very terse spams, end up in my Spam Suspects folder.  I'd say 5-10% of my spam ends up there first.  That doesn't bother me though, because the number's still small enough that I catch the occasional legit email that ends up there.

BTW, I pretty much NEVER get false positives with SpamBayes, and that's really the key to any spam filtering.

David
Monday, November 24, 2003

he mean not simple words, but imagine if spammers start to include 2-3 page text under the ad, there is a bigger chance that it will look a normal mail than not

n/a
Monday, November 24, 2003

The mozilla bayesean spam filter didn't work well
enough for me to keep using it. I get a lot of training
material and it repeatedly let even copies of the same
spam through.

son of parnas
Monday, November 24, 2003

I think I know which spam the original poster is talking about. I just received something that got through Spamassasin and Mail's filters.

It was your usual spam message, but every other line was some famous quote. The message was probably two pages long.

Fred
Monday, November 24, 2003

"Then you have bad filtering software, you've miscategorized some spam or you don't get enough spam to properly train the filters."

Yeah, right.  You just haven't been hit yet.  I use SpamBayes at work, I haven't miscategorized spam, and I feel that several thousand emails is more than enough for training.

Besides, it all worked great for months until the last few weeks, when the spammers got wise to it, and started appending innocent text.  It'll be interesting to see what the vendors response to it is.  I can only hope it's not, "You miscategorized some spam"...

Grumpy Old-Timer
Monday, November 24, 2003

Don't you mean:

"John, Sp.ammers now thwarting Bay esean filters 4356r6t8d46"

www.MarkTAW.com
Monday, November 24, 2003

I've always wondered. We used to use Bayesean approaches for document classification. It worked OK but required a very sizeable collection of preclassified documents as a training set, and needed extensive maintainance and update of the training set due to drift and changes in the domain.
I guess using this approach in a hostile domain will prove interesting.

Just me (Sir to you)
Monday, November 24, 2003

"Congress is in the process of passing legislation requiring some kinds of spam such as adult content, to indicate this in the subject line. "

Unfortunately the bill is horrible.  It doesn't allow recipients of spam to sue for damages, so enforcement will be almost nill.  It also preempts tougher state laws (like California's.)  Finally, unless and until the FTC develops a "do not spam" list, it gives maketers a green light to send as much spam as they want.  It basically says that spam is OK, as long as it's not misleading.

Of course, it also doesn't prevent shadowy spammers who cover their tracks too well to avoid enforcement, or spammers who operate from overseas.  This will be a problem with any spam legislation, however.

http://www.cauce.org/news/index.shtml

Robert Jacobson
Monday, November 24, 2003

Who in their right mind would enter their email address on a do not spam list?  You're just asking for that email address to be spammed even harder by amoral spammers.  Spammers who don't care about laws could get a list of thousands of legitimate email addresses to spam.  The phone list is a different medium, you can't spam a list of phone numbers as easily and as untracably.

chris
Monday, November 24, 2003

chris: exactly! Now that some spammers are releasing viruses that allow them to hack into other people's machines to send spam, what difference does it make that sending spam is against the law?

RocketJeff
Monday, November 24, 2003

Chris,

Agreed that a "do not spam" list is potentially a _huge_ target for abuse.  Unfortunately, the federal "do not call" list is so popular that this legislation started sailing through once was amended to include a "do not spam list" provision.  (I think it passed the Senate unanimously.)  The general public isn't sophisticated enough to understand the differences.

The only real solution to cut down on spam is to make it an "opt-in" system -- making it illegal to send commercial email unless the recipient has previously given permission.  Various opt-in bills have been kicking around Congress for years, but the direct-marketing lobby keeps fighting them.

The only mitigating factor with a do-not-spam list is that the list wouldn't necessarily have to be widely distributed.  Instead, it could work by being housed on a central FTC server.  Marketers could then send their list of email addresses to the FTC and have that list cleansed of "do not spam" addresses.  However, the central FTC servers would immediately be target number one for hackers.  It sounds like a recipe for disaster.

Robert Jacobson
Monday, November 24, 2003

>>
All they do is include a sentence of two of random text, and it causes the filters to generate a false negative.
<<

I'm not sure about the spam filters in common use but my understanding from reading Paul Graham's articles is that his implementation considers only the top fifteen tokens in its calculation ('top' being defined as the furthest from neutral). 

Thus inserting random text won't throw off the filter unless the spammer has figured out a set of strong non-spam tokens to include.  These non-spam tokens are determined on an individual basis so it would be very difficult for the spammer to guess these.  For example, your first name, your zip code, or the company you work for would likely be strong indicators of non-spam since spammers typically don't (and can't) personalize email in this manner. 

It's possible that spammers could insert common non-spam tokens but it should simply take a little extra training to compensate for this since these tokens would then be promoted to neutral (or even spam if they get popular enough).  Or perhaps the filters could be improved by collecting a set of these words as used by spammers to ignore in calculations.

SomeBody
Monday, November 24, 2003

They way you defeat random-word-insertion with Bayesian filters is by training on your own non-spam email. The filter will learn the difference between random vocabulary and YOUR non-spam email.

Personally I've found Bayesian filtering to be good but not nearly perfect. "really good" spam these days looks like this: "hey check out this link: http://whatever no more: unsubscribe". Honestly that's pretty hard to tell from a legit email...

IMHO no legislation will ever solve the spam problem. The only solutions are ubiquitous, automatically-enabled-for-Grandma, very strong spam filters (with counter-attack measures), or an entirely new email infrastructure that raises the cost of transmission for untrusted senders.

Dan Maas
Monday, November 24, 2003

I brought this up a few weeks ago.  The problem is not the randomness of the text, they are entirely reasonable sentences that overload the statistical average of the whole message.  They're HTML for the most part and the wads of text are invisible in the better examples.

That said, they are now being filtered reasonably by my setup, though the cost is the occasiona literary email spinning towards the penile erectile dysfunction mails and those promising eternal life or wealth from the family of a deposed African statesman.

Simon Lucy
Monday, November 24, 2003

Still I wonder why anti-spam software developers do not concentrate on the vital part of the spam - the external hyperlink reference.

Sure spammers will copy/grab random texts from the web (Shakespeare, Google news) to fool textfilters, but they still can't sell you anything right in your email client so they need to lure you to their shops.

They can thwart your filter, but they cannot hide their shops on credible sites, so this is their weakest point.

No matter how sophisticated the message is, if it contains a link to http://some.site.ru/maria288r2/optout it sure is spam.

Johnny Bravo
Monday, November 24, 2003

Why do you people get spam?  I run my own mail server and never get spam, with two exceptions:

1) The ubiquitous "Nigerian" scam (got this once)

2) An Ebay scam, trying to entice me to a web site to enter my login credentials (only immediately after I sold something - I have an old Ebay login that equates to my e-mail address).

Why the need for all the complexity? 

Of course, I ignore mail sent to webmaster, postmaster, etc.

Mark M.
Monday, November 24, 2003

Are you kidding?

Slashbot
Monday, November 24, 2003

A recent study found that something like 60-70% of spam comes via email addresses that are available on the web. (and luckily the persistence isn't long, so if you remove or obfuscate your address now, spam subsides quickly)

I just updated my Bayesian filter with a new batch of 500 spam messages. Its accuracy went through the roof! It's already nailed 50 spam in the last two days, with zero errors. All of them had 90-99% spam probabilities. (before I was getting 2-3 false negatives per day). So, keep training your Bayesian filters, it pays off...

Dan Maas
Monday, November 24, 2003

Paul Graham recommends that spam filters spider all URLs contained in spam. If everyone did this, sending spam would bring an immediate and massive denial-of-service attack on yourself... (http://www.paulgraham.com)

Dan Maas
Monday, November 24, 2003

My feeling is whatever law they pass must some how generate enough revenues by itself to fund more anti-spam witch hunts. Spammers aren't really that rich, they just cause a lot of damages, and unless the rich ISPs or common joes getting hit by spams is convinced into paying a protection fee--in the billions--nothing will change. I can't think of any policing body interested in stopping a criminal if they can't get paid to do it. I think pay as you go emailing may inevitably do just that. Many north americans pay for private security--paying $20 a month for an anti-spam head-hunter is going to be the happen in the near future.

Li-fan Chen
Monday, November 24, 2003

Spammers aren't that rich? Obviously you never read the interview with the King of Spam where he points to the wing of his mansion that was paid for by such and such a campaign.

www.MarkTAW.com
Tuesday, November 25, 2003

>> No matter how sophisticated the message is, if it contains a link to http://some.site.ru/maria288r2/optout it sure is spam.

^^^^^ Absolutely brilliant.

On a different note -- why not move to E-Mail 2.0 with mandatory *encryption*? That would increase the cost per transmission to the point where mass mailing in the millions becomes impossible.

Alex
Tuesday, November 25, 2003

Now what do we do with the Instant Message spam I've been getting?

Session Start (12553037:228821721): Mon Nov 24 18:13:34 2003
[18:13:34] 228821721: Hi       
[18:13:39] Me: hi
[18:13:39] 228821721: brb, im on my cam on my homepage http://girlsname.somedomain.com
Session Close (228821721): Mon Nov 24 18:15:33 2003

www.MarkTAW.com
Tuesday, November 25, 2003

"Paul Graham recommends that spam filters spider all URLs contained in spam. If everyone did this, sending spam would bring an immediate and massive denial-of-service attack on yourself... "

Great. So appart from having to deal with barrages of fake bounces to random email adresses at my domain, masses of spam from ill configured anti-virus/anti-spam setups assuming the return adress on a spam/virus mail is truthfull, I will now have my website DDOS'ed because some fools assume every link in a spam goes only to the source.

" Is you man on the Patch? http://discuss.fogcreek.com/joelonsoftware/ "

Really clever guys :-(

Just me (Sir to you)
Tuesday, November 25, 2003

"I will now have my website DDOS'ed because some fools assume every link in a spam goes only to the source."

It would also take very little time for the script kiddies out there to start using this as a weapon to DDOS their favorite target.  They've already got plenty of zombie PCs under their control.  All they would need to do is send out lots of spam from these PCs with the target's web address in the spam.  Even worse, they could unleash a virus that does the spamming for them.

Matt Latourette
Tuesday, November 25, 2003

Added to which, if they did something like

http://some.hosting.com/users/spammer/confirm-address/you%40yoursite.com/ ...

You've just confirmed that you have a real address, and are contributing to a DDOS on a potentially innocent hosting company.

Steve P
Tuesday, November 25, 2003

*  Recent Topics

*  Fog Creek Home