How to Test Enterprise Spam Defenses?

How many licks does it take to get to the Tootsie Roll center of a Tootsie Pop? As with the number of false-positive results generated by enterprise anti-spam products, writes Security Center Editor Larry Seltzer, the world may never know.

There are many measures of a spam-blocking product, but probably the most important one is the number of false positives it generates. False positives are non-spam e-mails that the product mistakenly classifies as spam. They represent the most important failure of a product because the more of them there are, the less you can trust the spam blocking.

But testing products for their false positives is difficult. As I said in a recent column, you dont really know if you have false positives (or how many there are) unless you go through the blocked mail and count the ones that shouldnt have been blocked. Some of you disagree with me on this, but Im unmoved on the subject.

Its even harder when youre trying to do an independent lab test, which happens to be my line of work. I have been designing and implementing software tests for about 16 years now, and testing enterprise spam blockers is clearly the most difficult problem I have ever encountered. Ive developed many tests that simply took a lot of work, but this ones over the top.

Its not as bad testing consumer anti-spam products. My usual test plan, which Ive used in the past for PC Magazine, is to turn one of my legit accounts into a forwarding account. It forwards to a series of test accounts. Ill usually collect about 200 messages for training purposes and then about 1,000 for the actual test.

After I train and then run the 1,000 messages through the filter, I examine the blocked mail for false positives and the non-blocked mail for spam. Because this is my mail and the numbers arent too huge, I can handle the process manually. Assuming Im consistent from product to product, this method is accurate.

But how should one test an enterprise product? Is 1,000 messages, all to the same e-mail address, enough? As a test designer, what I would like to do is to run half-a-million messages through them. I want the message base to include large numbers of threads involving group addresses and people inside and outside the organization. A number of problems make this impractical for a test situation.

First, thats a lot of messages for a lab analyst to examine, and someones going to have to examine them manually. Its just too much work. Manual analysis doesnt scale.

The second big problem is where to get a large enough collection of real mail to mix with the spam. This is yet another example of a problem you might think is easy to fix but which is actually quite problematic. We need thousands of messages typical of a corporate mail database. Perhaps youd like to volunteer your own corporations e-mail for our testers to examine as part of our benchmarks? I didnt think so.

As it turns out, even people at PC Magazine arent comfortable with using their own company e-mail for this purpose, and I dont blame them; it should be confidential. And remember that the interesting messages in this test are the ones coming from outside the test organization, so if you use real mail, that means youre using messages from third parties, probably without their consent. How fair is that?

My best theoretical solution to this problem is to take messages from moderated newsgroups (such as and change the users into fake users from a fictional corporate directory and a fictional directory of outsiders. I would tag these good messages with a custom header so that they can be recognized in the post-filtering message database.

I have some other theories for how to construct a semi-synthetic benchmark that would mix this database of known good mail with another database of current spam, but it involves some moderately complicated programming. If it all works, at the end I should have spam-filtered mail on which I can programmatically count false positives and negatives. This should scale.

Ive been talking to other people with a lot of experience in test development, in the anti-spam business and elsewhere. I havent yet found a good implementation of a test that would provide reliable and repeatable results across multiple vendors. I think well get there because it would be an invaluable tool for corporate IT buyers. In the meantime, however, you have to have the right perspective on any benchmarks you see; we dont know enough about how well these products work.

Security Center Editor Larry Seltzer has worked in and written about the computer industry since 1983.

More from Larry Seltzer