Fog Creek Software
Discussion Board




Joel, Seek of Spam? ashkiuyekjha ? :-)

The question was posed on OpenIT why there always seem to be random words in the subject lines of spam messages.

I don't buy that these words make the spam more likely to get past filters. I don't see why that would make a difference, in fact, it would appear to make spam easier to detect.

Many spammers are small independent contractors. What I think may be happening is that  these sequences are a sort of hash key used to determine if the campaign was conducted satisfactorily. 

The parties paying for or brokering each spam campaign would have sampling sites mixed into the email addresses. The spam containing the unique keys would need to be received at the majority of these sites in order for the campaign to be considered payable.

Of course, maybe spammers aren't that smart, and same instinct that makes them use heavily accented subject lines :-) makes them also insert crap into spam for no good reason.

So, this is a different topic for JOS - the design of something obnoxious that everyone hates ;-) - but I thought it would be interesting to debate.

Goddard Bolt
Thursday, August 28, 2003

Actually, there's another reason for the garbage that you may not be aware of.

Administrators of high-volume mail systems used to set up machinery to detect that lots of identical emails were being sent through their servers.  (This was like 5-6 years ago.)  Spamming software then quickly evolved to add random junk in the subject line and/or body, in order to foil these detectors.

Phillip J. Eby
Thursday, August 28, 2003

"I don't buy that these words make the spam more likely to get past filters."

I'd agree with this comment, but I personally think that it's designed not for professional filters, but for Outlook Rules that "regular" people use to filter spam, instead of purchasing a package to do it for them.

There is a maximum # of rules you can have in Outlook, so the non-professional-filter user is limited to the possible # of words to be filtered.

"...these sequences are a sort of hash key used to determine if the campaign was conducted satisfactorily."

Couldn't spammers just use different URLs to achieve the same thing?  In the past I've received the same spam at the same time from three different "marketers", all 3 of which went to different URLs in the same domain.

Jeff MacDonald
Thursday, August 28, 2003

>> Couldn't spammers just use different URLs to achieve the same thing?

This would test only the response rate to certain campaigns, and would not statistically assure  that whomever delivered SPAMs on contract actually sent them in the first place.

On the other hand, if (for instance) 95% of 20 or 50 email addresses used only as test points received emails containing a unique signature in a spam campaign, then it would indicate that the sender had sent at about 95% of the emails correctly. I would think that anyone paying for spamming services would want this kind of independent verification.

My guess is that the random looking words have a function related to this. But again, I may be totally wrong.

Goddard Bolt
Thursday, August 28, 2003

My take is that it's most likely an attempt to defeat super-lame filtering. When people spam USENET, they use the same strategy.

Since it adds basically nothing to the work required to spam, and it at least slightly increases the chance the mail will get through, why not use it?

The latest annoyance I've seen is emails that uses capital letters on a field of smaller characters to spell out a message:
..G.E.T...H.E.R.B.A.L..V.I.A.G.R.A..H.E.R.E..
some of these are essentially unreadable, though when they're sent as HTML mail, with the '.'s in light grey, it actually works pretty well.

-Mark

Mark Bessey
Thursday, August 28, 2003

I think that the images in the HTML spam are enough to track back.

I'm personally backing the theory that they make it harder to spot spam by looking for patterns in an entire feed and also to foil simple keyword blocking.  It's really really easy and fast to keep a rolling buffer of the last few thousand messages that have come in and store the MD5 or CRC32 digest of said messages and then flag identical messages for advanced processing.

The other thing is that a lot of admins end up writing a simple brute-force mail filter that will flag anything containing phrases like "Banned CD".

However, trying to get by the spam filters is a faustian bargan, because while somebody may be sending you an old fart joke that involves the word "Viagra" that doesn't get a high enough SpamAssassin score otherwise, but if the message contains "VViagra" or "v1agra", it's easier to tag.  If a message isn't base64 encoded and has a bunch of random gibberish in it, it's easier to tag.  The only way to get by them is to send spam that is completely conversational, but I don't know if you'd have advertising at the end.  I don't normally talk about enlarging certain organs with my friends, no?

Flamebait Sr.
Thursday, August 28, 2003

I have an idea that the random garbage increases the chance of the emails getting past bayesian filters as well, and they are becoming more popular.

The bayesian filtering (as I understand it) depends on recognizing tokens and giving those tokens a probability of being spam.

If the tokens are not recognised then the email is initially tagged as 'not spam' until the user indicates that it is, at which point any emails with similar tokens have an increased chance of being spam as well.

...so ensuring that there are always unrecognizable tokens seems to me to be a reasonable method of bypassing that filtering method.

eventually we will have to combine the bayesian filtering with a spell checker as well.....

FullNameRequired
Thursday, August 28, 2003

Just a thought -- has anyone tried to do a Google search on this subject?  As we used to say in graduate school, "A month in the laboratory can often save you an hour in the library".

J. D. Trollinger
Thursday, August 28, 2003

The bayesian filters I've seen have three levels of words - known spam, known good, unknown.  A few good words will out-balance a lot of spam words, but unknown don't count for much.  Most filters only consider the top n spam/good words in creating a score anyway.  All this specifically to stop 'noise' unbalancing spam/good words.

i like i
Friday, August 29, 2003

If you're interested in these topics, the forums at the Sourceforge page for Popfile are a good starting point - they have a separate forum for techniques for bypassing bayesian filters.

Unsigner
Friday, August 29, 2003

*  Recent Topics

*  Fog Creek Home