Second-Generation Anti-Spam Solutions

by Kaitlin Duck Sherwood
First revision September 9, 2002

It seems like a bumper crop of anti-spam software has sprouted overnight! This is good news for everyone who is frustrated by unsolicited commercial email.

Why Now?

Someone asked me why, given how long people had been bothered by spam, he should expect the new crop to be any better than the old crop of anti-spam solutions.

It's important to remember that spam has only been with us for eight years. While that may seem like forever in Internet time, it takes an extremely long time to figure out how to best use any novel communications technology. It took ten years to setlle on "Hello" as the proper thing to say when answering the phone, for example. (Alexander Graham Bell argued for "Ahoy" instead.)

Furthermore, spam has gotten much, much worse in the past year, making anti-spam software start to look like a reasonable investment for businesses. A lot of bright people figured this out and started companies to sell to those businesses.

The sheer number of anti-spam entrants is a good sign: the competition will breed innovation. But also, the fact that businesses can't tolerate the thought of losing even one customer to the spam bucket is forcing the anti-spam solutions to get more precise.

In the rest of this essay, I'll talk about the limitations of the first generation of anti-spam tools, the features of second-generation anti-spam tools, and then give a few recommendations for specific products.

First-Generation Spam Tools

The first generation pretty much all used simple pass/fail rules:

   if the phrase "over 18" is in the body, mark it spam
   if the phrase "teen" is in the body, mark it spam

Unfortunately, these rules were not clever enough to distinguish accurately enough. For example, with the above rules, the following message would get deleted:

   Subject: please send me seventeen XJR37-Q grommets

"Whitelisting" -- allowing people to set up a list of "known good" addresses -- helped, but usually the whitelists didn't integrate well with address books. People had to maintain two sets of addresses, which was a pain. Furthermore, whitelisting didn't help when strangers (customers!) sent messages, or when someone changed their email address.

Second-Generation Spam Tool Features

This second generation is using a variety of different attacks -- and combinations thereof -- that are much more effective. There are combinations of:

Scoring.

Instead of a strict pass/fail, the filters add/subtract penalty/bonus points based on how often that feature shows up in spam. For example:

   if the phrase "over 18" is in the body, add 50 points 
   if the sender is in my is book, subtract 90 points
   if the phrase "teen" is in the body, add 10 points 
   if the phrase "penis" is in the subject line, add 80 points 
   if the body contains an image, add 70 points 
   ... 
   if the score is > 100 points, delete the message

There are two different techniques for scoring. The first is for the anti-spam vendor to bundle a list of filters with the product. The second is for the anti-spam tool to adaptively figure out what the filters should be. (See Paul Graham's A Plan For Spam. for an example of an adaptive algorithm that uses Bayesian statistics.)

Collaborative filtering.

With collaborative filtering, your software connects to a database that contains digital fingerprints of known spam. If you get a message whose fingerprint matches, that message gets marked as spam. If a piece of spam gets through, you can click on a "this is spam" button to report it.

Collaborative filtering only works if spammers send the exact same message to thousands of people. In general, the spammers only have to change the message very slightly in order to completely change the digital signature. Thus there is a "sweet spot" where collaborative filtering works: there have to be enough others in your collaborative network to find spam, but not so many that spammers start to individualize messages.

We are in such a a "sweet spot" right now (late 2002), but I don't know how long it will last. Collaborative filtering may work for another two or three years.

Challenge/response.

If a message arrives from someone who is not on your whitelist, the software automatically sends a reply with a request that is easy for a person but hard for a computer. For example, the reply might ask the person to send another message with the word "aardvark" in the subject line. (If it's a spammer, the return address is probably fake, so it probably won't even get to them.)

Challenge-response systems transfer effort from you to your correspondents, so you are dependent upon the goodwill of your correspondents. If enough people start using challenge-response systems, I know that my own goodwill is going to go way down. For example, if I see an error on a Web page, I might not bother reporting it if I think I'm going to have to fight my way through a challenge.

I am also concerned that spammers might start forging return addresses even more, hoping that the address that they forge will be one that is on your "approved" list.

What's worse is that a challenge may get caught by a spam filter. My spam filters caught a challenge from ValiMail (because it had an embedded image); I only noticed it because I look over my spam messages carefully.

Finally, a challenge-response system MUST allow messages from senders that you send messages to. Otherwise, you could issue challenges in response to other people's challenges! This means that the challenge-response must be integrated tightly with an email program

I think that challenge-response is only reasonable in conjunction with a spam scoring system:

If a message has a very high spam score, delete it.
If a message has a very low spam score, let it through.
If a message has a borderline spam score, issue a challenge.

Despite the limitations of collaborative filtering and challenge-response, all three techniques kill more spam with much less collateral damage than the first generation strategies.

I also know of some anti-spam solutions that are in the works that may do an even better job. I expect that anti-spam solutions will be far better next year -- and I'm really looking forward to that!

Recommendations

Note -- on 24 Feb 2003, I took a job at the Open Source Applications Foundation, and will be devoting essentially all of my attention to them. This means that I'm not going to have much (if any!) time to evaluate more anti-spam products.

I believe that in the long run, scoring filters are they way to go, particularly if they are at least somewhat adaptive. In the short term, as I mentioned before, the collaborative filtering systems might work well for you. If your account doesn't get many messages from strangers -- like on your personal account -- then challenge-response might work well for you.

I have not tested all of these. In some cases, I'm going on recommendations. I also have looked at the information programs have on their Web sites to guess which programs might be good. As I test more, I will update this list.

All

SpamBayes is one of the most mature anti-spam offerings out there, and it has been widely ported.

Outlook

SpamAssassin Pro

For Outlook users, my current favorite is SpamAssassin Pro from Deersoft. (Note: SpamAssassin Pro is one of the few that I have actually tested myself.) SpamAssassin Pro properly flagged 810 of 839 spam messages. Of 315 non-spam messages, it did identify 5 as spam, but those five weren't clearly spam, even to me.

Those five "false-positives" were semi-commercial messages from companies I used to have a minor relationship with. Back when I set up the test messages set, those five related to my business activities. By the time I actually tested SpamAssassin Pro, those five were no longer interesting. So in some sense, SpamAssassin Pro got it right and I got it wrong.

(Also note that SpamAssassin Pro makes whitelisting pretty easy; if I had whitelisted those senders, I would have had unambiguously zero false-positives.)

Matador

Matador also does a very good job. Of 492 messages, it found 439 spam messages and left alone 38 non-spam messages. It did miss 17 spam messages, but 13 of those were from the same sender. It also tagged as spam one message which was a hoax forwarded from a friend who wasn't in my address book -- which seems like a reasonable thing for Matador to do.

Outlook Express

For Outlook Express, try SpamInspector or iHateSpam. Both are integrated tightly with Outlook Express; I haven't tried either of them but the SpamInspector site inspires more confidence in me.

You can also try the generic POP3 programs listed below.

Any POP3

I've also heard raves about MailWasher, in part because it is free! It will work with any POP email program (e.g. Eudora, Outlook Express, but not AOL), but not being integrated means a little bit more work on your part.

AOL

AOL users, you can try SpamInspector or MailShell, but I don't know how well they integrate.

Mac OS

My current favorite for OS X is Spamfire. It has almost all the features that I want, a nice interface, and discriminates well between spam and non-spam.

It works okay with OS 9 -- except for occasional Spamfire hangs. They are working on fixing that problem but it isn't fixed yet.

Also, if you use Spamfire with Eudora, you might need to add one filter of your own. Don't worry, it's easy.

I hear that the built-in junk mail filter for the built-in Mac OS X mailer ("Mail.app") uses Baysian filtering and is pretty good. I've also heard good things about SpamSieve for Mailsmith, Entourage, and PowerMail users.

Server-Side

On the server side, there are some open-source tools with extremely good recommendations:

SpamAssassin -- scoring filters
bogofilter -- Bayesian
iFile -- Bayesian

A discussion of the Bayesian algorithm is at A Plan For Spam.

Other Programs: POP Scrubbing Services

If you use an obscure email program, you may need to use a POP scrubbing service. You interact with a Web site -- so it doesn't matter which email program or operating system you use. Every once in a while, their service will access your mailbox, delete any messages that it thinks are spam, and leave the rest.

(This only works if your email accouont has a POP interface. AOL accouonts do not. Yahoo does, but only for a fee. Hotmail does not.)

There are also services that give you an email account. All non-spam messages to that address are forwarded to your "real" email account. Some, like SpamCop, can either send messages through their account or scrub your POP account. Others, like POBox.com, only scrub the account you get from them.

Steve Bass/PCWorld review of IHateSpam, MailWasher, Spaminex, and MailWasher Updated most recently on 16 October 2002.