Fog Creek Software
Discussion Board

Caution on Bayesian Filters

They can reject valid email very easily.  You can lose email  from customers.

I emailed a president of a local (technology provider) company a few days ago and he never got the message. When I called him he replied that their antispam measures trapped my message. (I sent the message to his address as printed on his "bidness" card).

I gave him my address over the phone and he "whitelisted" my address to get the message. When he replied back to my original message, the subject reply line contained "Bayesian filter detected spam".

FYI. I was considering trying a Bayesian filter on my own accounts but this experience says otherwise.

Bored Bystander
Monday, December 15, 2003

Yeah, that's why I wouldn't use one unless I'm drowing in mail - the cost of one false positive is way too high.

Jason McCullough
Monday, December 15, 2003

Well Bayesian filters do not say that there will be no false positives.  You have to show (teach) the tool what the false positive is and it will learn from it and adjust its filter next time -- whitelist being one of the easiest solution.

Code Monkey
Monday, December 15, 2003

I use one. It works great. I get tons of spam (i've had this email address for 10 years or so) so I had a lot of spam to train it on. I set up a spam folder that I have Outlook delete stuff older than a week

Every couple of days, I take a quick peek through it to make sure that nothing important gets dropped. The filter I'm  using differentiates between possile spam and spam. You can set the probablilty anywhere you like, but i have it set up so anything with >0.3 is possible spam and >.9 is spam.

So far, I get stuff miscategorized as possible spam every few days, but nothing has ever shown up as spam that's not. Every couple of days I get a real spam email in my inbox.

The stuff that is miscategorized is typically short emails with URLs in it. Such as a friend sending an email saying "check out this site: <link>".

Monday, December 15, 2003

There's also a *lot* of difference between the algorithms used by so-called "Bayesian" filters that can make a 10-to-1 or greater difference in how many false negatives or false positives they get.

Spambayes is one of the few "bayesian" filters that has had extremely rigorous testing.  Every time somebody posts a new idea for what could be added, it gets put through quite a bit of testing on different people's databases, using a cross-verification procedure.  The cross-verification re-trains the database on various small subsets of the corpus, then scores the remainder of the corpus.  This ensures that the test isn't "cheating" by scoring messages that have already been seen by the scoring database.

AFAIK, very few (if any) of the learning filters out there have been so rigorously tested as Spambayes to ensure that the tokenization and scoring methods are sound.  Most filters also just use the original Paul Graham tokenizer and scoring hacks, while Spambayes uses an algorithm specifically designed to produce relatively "flat" scores, so that it can rate mail as "ham", "spam", and "unsure".  For all practical purposes, the Graham algorithm doesn't *have* an "unsure", so if it sees something that it has never seen before, the classification will be almost random.  A lot of the early development of Spambayes was focused on getting the filter to "know that it doesn't know" how to classify an e-mail.

Phillip J. Eby
Monday, December 15, 2003

Nice post, Phillip. I found it very informative.

Exception guy
Monday, December 15, 2003

The caution should apply to any sort of filtering, not only the "Bayesian" type. Any processing system that substitutes intelligence by total reliance on a set of rules is prone to screw ups.

Monday, December 15, 2003

You'll notice the spammers are trying to poison or dilute the filters by including chunks of ordinary text, the theory being that 'normal' words will eventually end up being tagged as spam words, your filter starts stripping real emails, you dump the filter, spam gets a free rein, everyone's happy.

Except they use really odd text, not "How are you, weather is nice", but bits of old speeches, historical dates etc that most people wouldn't use.

Of course, the message is HTML, with the text some shade of white, and there's a web-bug, so not all is lost.

Monday, December 15, 2003

Does anyone know if Mozilla Mail (as part of Mozilla 1.5) uses Bayesian filtering? I started using it yesterday and it marks a lot of email as spam. There is a "Not Spam" button which makes me think maybe this is part of the bayesian training.

Daniel Searson
Monday, December 15, 2003

Thunderbird has Bayesian filtering. I always check my junk mail though, it's bolded so it's easy to find in my junk folder. I just click the little dot that marks it as read if it is spam, and read it if it might not be spam.

I would never trust any spam filtering to be 100% accurate.
Monday, December 15, 2003

I don't expect that any SPAM filter will ever be perfect.

As long as I can get mine to be accurate within a few percent of perfection, though, it is still FAR easier than sifting through everything manually.

I find that with a quick glance through the quarantined folder, the occasional false positives show up as plainly as a fresh dent in a shiny new sportscar.

That's why it seems like a better idea to filter SPAM at the client instead of on the server...

Tim Lara
Tuesday, December 16, 2003

I was going to switch to one -- for one thing my "brain filter" is not 100% reliable either!  I have deleted real messages, thinking they were spam with a quick glance.  It's hard not to when you get so many messages every day.  Hopefully that spam legislation will actually do something.

Tuesday, December 16, 2003

Eudora's built-in learning filter does produce a few false positives every now and then but it's too rare to bother. Looking through the junk folder isn't that hard, and it's still easier and faster than marking all the spam mail myself.

Chris Nahr
Tuesday, December 16, 2003

Daniel S,

Yes, mozilla mail has done bayesian filtering for a while. However something about it's implementation meant that it took me maybe twice as long to train it as my previous solution (bogofilter on unix).

However after it's been running for a while it's damned impressive.

Michael Koziarski
Tuesday, December 16, 2003

Singling out Bayesian filters because someone's filter misclassified one message doesn't make sense. No spam filter is ever going to be perfect.  But some of the Bayesians are pretty darn good. 

Mine is currently 99.24% accurate, and it's classifiying mail into eight categories (personal, mailing lists, one for each major client, etc.) not just "spam" vs. "not spam."  I can't recall it ever categorizing something as spam that wasn't, but anyone using ANY spam filter should periodically check the spam folder for false positives. 

Chris Dunford
Tuesday, December 16, 2003

The point about Bayesian filtering is that you are supposed to keep training it.

I've started using Spam Bayes and am highly impressed. Sometimes it detects mail as possible spam or even spam when it isn't but after I restore it it doesn't make the same mistake again.

It is still easier to have a quick glance at the junk folder than to try and sift all from the inbox manually.

If you are really worried about losing a mail, then open and read everything. But you'll end up spending a couple of hours a day.

Stephen Jones
Tuesday, December 16, 2003

I've been using spambayes for quite a while now and haven't had a false positive for months. (I still check though). Got a few early on, then the occasional one, then they really became rare....

The problem with them is that they're only as good as the person 'teaching' it / checking it. If a user just turns it on and assumes it will 'just work', it won't be very effective / accurate.

Bored Bystander - I wouldn't discount the whole concept bayesian spam filtering based on one experience.

I find it invaluable. It gets things right 99% of the time, and takes minimal time to identify and deal with the other 1%.

Of course if you only get a minimal amount of spam, its probably not worth the effort. (I made the mistake of posting to usenet with a valid email address).

Gordon Hartley
Tuesday, December 16, 2003

> I find it invaluable. It gets things right 99% of the time, and takes minimal time to identify and deal with the other 1%.

The problem is not the 1% incorrect classification rate but whether the incorrect classified email is a false posite or negative. Allowing a small %age spam through the filter is *much* easier to deal with than incorrectly classifying "valid" email as spam.

Fortunately, most filters assign this weight appropriately but as BoredBystander mentioned, mistakes happen.

As an aside, isn't it bad design on Yahoo Mail to place the "This is Spam" notification button in the "view mail" screen rather the "list of email" screen. Many spam emails are instantly identifiable via the subject line.

Tuesday, December 16, 2003

Now if Bored had got the message "your email has been bounced by a computer" he wouldn't be saying he didn't trust computers for communications and was going back to carrier pigeon.

The point about Bayesian filters is that they need training, and thus are basically as effective as the user. If the user doesn't train them properly they won't work.

Stephen Jones
Tuesday, December 16, 2003

I have to (once again) give my props to SpamBayes.  I get about 200 spam a day and have now been using SpamBayes for a couple of months.  I don't think I've had a single false positive.  I do get legit email in the "Suspects" category every now and then and the rare email does get through to my Inbox.  But, and this is EXTREMELY critical for any spam filter, I simply don't get false positives.

SpamBayes' algorithm is very tight.  If someone got a false positive, it's because they aren't using as good of an algorithm, or they miscategorized some emails early on.

Tuesday, December 16, 2003

Something that I haven't noticed being mentioned, is where the filter is running. If the filter is running on the client then it is easy to deal with a misdiagnosis. On the other hand, filters running on the company's email server can be a mayor hassle. I've had several email disappear into the ether that way.

Tuesday, December 16, 2003

I've used two Bayesian filters: one I wrote myself, and Spambayes. I don't think either of them has ever given me a false positive, after many months of use starting soon after Graham's original "A Plan for Spam" article. Both have nice low false-negative rates too. I've found that a good Bayesian filter is superior to my brain in both false-positive rate and false-negative rate, unless I give my brain more time per message than I am willing to invest. I do check everything in my probable-spam folder (very superficially), to reduce the expected false-positive rate further, but as I say this hasn't been an issue yet.

Gareth McCaughan
Wednesday, December 17, 2003

I have noticed that spammers are now trying to poison the Bayesian Filters by including hundreds of random words like {turpentine intelligent westonia hologram organic}
in an attempt to confuse the filter.

They do not use connector words like {a, an, and, or, the}

This is the dead give-away that it is a spam message.

Also the spam words {Wi-n, FR-E, V-gra} are in GIF format.

The random words are at the bottom of the spam message and there are literally hundreds of them.

Also at the very bottom are random characters like

I think that we need to create "Smart Bayesian Filters"
that can detect poison attempts.

Tom Paris
Saturday, February 21, 2004

*  Recent Topics

*  Fog Creek Home