Fog Creek Software
g
Discussion Board




Step #231 in the spam escalation wars


I use a Bayesian filter for all my email to tag spam and divert it from my usual attention and on the whole it is working well.

In the last few days I've noticed a type of email getting through that may be difficult to filter.

Its a multipart mime mail and the plain text alternate is:

"never as yet published in full, only abstracted in the Origin.
despotic elements retained by the conquered nations as yet only"

The HTML component, which I'd never normally see as I only read plain text mail, had an advert for selling drugs online and at the bottom was

"when occupying Yang-p`ing and about to be attacked by Ssu-ma I, I could hardly repress a shuddering recoil as he came, bending amiably, these differences implied in itself a political classification. A "

So it looks like they're trying to poison the statistical filter by increasing the use of contextually irrelevant but general words so that the statistical score is likely to be similar to acceptable email.

This one fails to a degree because the subject line is mangled and obviously spam

"Che_ck ou't ou-r se,lection (of gre=at R"X mp_xsdjjd"

So, now I'm thinking of counter methods, since others will generate mail with that other filter avoidance technique the entirely irrelevant but superficially reasonable subject line.

I still don't buy the 'boil-the-ocean' solution of replacing SMTP since its not the delivery method but the content that's poisoned.

Instead, I suppose the same model as that used in the Cold War will be applied, small incremental improvements on both sides to counter the previous improvement, falling back occasionally to more primitive methods.

I can certainly look to tailoring the filtering so that it treats different components separately, scoring both the plain text and the HTML and choosing the a particular bucket when the score is different.  Or applying a vocabulary checker to see if the same words are used in both components.

The latter would work for me since if the plain text just says read the HTML I'll ignore it anyway.

Simon Lucy
Saturday, October 11, 2003

One word: SpamAssassin.

Brad Wilson (dotnetguy.techieswithcats.com)
Saturday, October 11, 2003

Oh, okay, that was just dramatic. :) If you're a Windows user, and your e-mail comes in via POP3, then SAproxy is what you want. Fully enclosed copy of SpamAssassin that masquerades as a POP3 proxy. I was using it while my mail host was fixing their broken SpamAssassin install. Brilliant, it is. :)

Brad Wilson (dotnetguy.techieswithcats.com)
Saturday, October 11, 2003

Well yes, though its pretty much equivalent to what I already use, combining Bayesian filters and header analysis.  So its going to have to handle exactly the same thing.

Simon Lucy
Saturday, October 11, 2003

My ISP offers SpamAssassin, and it *is* very good.  But the nonsense-text spams mentioned above have been leaking through to an alarming degree over the last week or two.

Hardware Guy
Saturday, October 11, 2003

By the way, has everyone else's (in the US) phone stopped ringing?
I don't think I've heard from a telemarketer all week.

[pleasant sigh]

Philo

Philo
Saturday, October 11, 2003

It got to the stage at work where I would physically disconnect the phone because it was ringing so much. It was only fax machines as well *sigh*  Recently it got better, so now I just pick up, it bleeps, and then I lay the headset on the desk. It must cost idiots who can't work fax machines a shed load of money.


Sunday, October 12, 2003

This has been about for a while on usenet, Simon. Step 232 is something which follows that thing about humans being able to read words even when they are misspelt as long as the beginning and end is correct. Not that it helps when I spamcop them, but I'm sure it makes them feel better, even if it makes them appear illiterate.


Sunday, October 12, 2003

"The HTML component"

Your scanner continued after this point? I kill on there merest sign of HTML in an email.

Maybe I'm just strange.

Bill Godfrey
Sunday, October 12, 2003

Maybe you don't communicate with very many people.  Lots of normal people, i.e. those who aren't programmers, use HTML email as a matter of course...

Chris Nahr
Sunday, October 12, 2003

I don't think Bill's strange, but maybe automatic deletion is a bit extreme. I filter all HTML mail into a separate mailbox, which I find makes it easier to spot the very rare good messages. It's true that a lot of people do send HTML mails, but often because it's turned on by default, not because they use the formatting features, and everyone I know has been quite happy to turn HTML off when asked.

as
Sunday, October 12, 2003

I filter out all html mail as well.


Monday, October 13, 2003

Well my mail client helpfully adds a text component to HTML mail only to say 'there's nothing here' so that doesn't bother me.

Simon Lucy
Monday, October 13, 2003

There's chance (probably a darn good one) Bayesian filter is not the end all solution. But technically speaking poisoning has to be done intelligently to have much bang for the effort. From what Graham explains it's clear that random words that have no relevance in one's world (what he communicates over email) will not score high. But if say a email address harvester can infer whether an email address holder loves barbie or enjoys surfing. That's key in poisoning the statistical outcome in a way that might get you some where.

So don't leave your email address on your favorite <enter actress or interest here> web guestbooks, can't be good in the long run.

Li-fan Chen
Tuesday, October 14, 2003

*  Recent Topics

*  Fog Creek Home