SpamProbe - Bayesian Spam Filtering Tweaks

SpamProbe - Bayesian Spam Filtering Tweaks
by Brian Burton
http://spamprobe.sourceforge.net

Background

In the summer of 2002 my inbox was flooded with Spam. After years of resisting the idea of installing a spam filter I had finally reached the point where one was truly necessary. I experimented with a widely used "catalog" type filter but was disappointed with its performance. The idea of a heuristic based filter with its constant update requirements was unappealing to me. Finally I resigned myself to the need to write my own filter. I planned to write a neural network driven filter as soon as I had a chance.

Fortunately I never needed to go that route. Around this time a friend sent me a link to Paul Graham's excellent article "A Plan For Spam." His algorithm was so elegant and his numbers so enticing that I had to try it. As a quick experiment I wrote a ruby prototype using a GDBM database. Paul's algorithm worked as advertised and my inbox became free of spam for the first time in years. I immediately decided to write an open source implementation to share the bounty with the rest of the net. The ruby prototype worked well but was too slow for production use.

Next I wrote a Java implementation that communicated directly with my IMAP server to classify emails and move them into different folders. The Java implementation used PostgreSQL as its database. At first PostgreSQL was terribly slow but after running vacuumdb a few times the database performance improved dramatically. Unfortunately Sun's IMAP client library proved unworkable and I dropped the Java implementation after a few days.

Finally I decided to write my filter a third time. Being an OO junkie I chose C++ as my implementation language. C++ has many warts but also many advantages. Chief among these are portability, performance, and a rich standard library. I eventually chose Berkeley DB as the database format for its portability and performance. Within a few days I had my C++ implementation working well and posted to SourceForge for others to share.

SpamProbe owes its existence and basic operating principles to Paul's ideas. It is not a direct implementation of Paul's algorithm however. There are some tweaks and changes applied in an effort to improve the algorithm's effectiveness. This paper describes the key differences. They are:

Use of different tokenizer rules
Use of word pairs as well as single words as terms.
Use of within document frequency to allow frequently appearing words to have greater weight
Removal of HTML tags to avoid false positives
Folding of 8 bit characters
Optional use of a variable sized terms array to allow all significant terms to appear in the score
Use of stubborn classification
Image Analysis

Tokenizer Rules

The SpamProbe tokenizer attempts to extract terms from emails in an internet savvy way. The tokenizer understands HTML entity references as well as URL encoding. Base64 and Quoted Printable encoding are also understood and handled properly. The tokenizer allows a small set of characters to appear within terms ('.', ',', '+', '-', '_', '$') and treats all others as white space. Pure numbers are ignored, but numbers linked together by punctuation are retained (i.e. 127 would be ignored but 127.0.0.1 would be retained). Leading dollar signs are also retained since 10,000 may not be spammy but $10,000 probably is. All terms are converted to lower case to minimize database size and avoid false positives based on case.

SpamProbe breaks tokens containing punctuation into additional sub terms by repeatedly separating the leading portion of the term. The latter technique is intended to extract the domain name from fully qualified host names in URLs. For example the single term mail.burton-computer.com would be broken down by SpamProbe into these terms: mail.burton-computer.com, mail, burton-computer.com, burton, computer.com, computer, and com.

SpamProbe uses Paul's technique of classifying some terms based on where they were found. Specifically particular terms from the subject, to, and cc headers are specially flagged.

Word Pairs

In his paper Paul mentioned that accuracy might be improved by scoring word phrases in addition to single words. SpamProbe normally uses word pairs (two word "phrases") as terms. A command line option allows an arbitrary number of words to be used. Testing has demonstrated that using word pairs has a noticeable impact on accuracy. Longer phrases generally provide an increase in accuracy but with diminishing returns. Since longer phrases lead to a bigger database the tradeoff of space versus performance led me to choose to use pairs as default. Users with ample disk space can choose to use longer phrases if desired.

Within Document Frequency (Repeats)

Within Document Frequency (WDF) means the number of times a term appears in a document. Many search engines rank documents with high WDF earlier in their results than documents with only a single occurrence of the term. I decided to apply this same idea to spam detection.

SpamProbe allows terms to appear multiple times in the top terms array. In contrast Paul's filter only allows each term to appear once in the array. To prevent a single term from dominating the top terms array SpamProbe uses a much larger array. The two parameters, array size (S) and number of repeats (R), are crucial to tuning SpamProbe's performance. I use the notation S/R indicate an array of S terms and maximum of R occurrences of any single term in the array. By default SpamProbe uses 27/2.

After extensive testing with my own email I found that 27/2 provided better performance than 15/1. In particular 27/2 yields fewer false positives in my email. Generally speaking higher values of R yield better accuracy and fewer false positives as long as S is increased at the same time. No doubt this result is highly dependent on the characteristics of a person's email. 27/2 provides a nice middle ground but people are encouraged to experiment on their own email to find the best setting for themselves.

Ignoring HTML

This proved to be one of the more controversial design decisions in SpamProbe. My primary concern was to minimize false positives. In my early prototype I scanned HTML tags as ordinary terms as originally suggested by Paul. I was disturbed by the fact that most of the spammiest terms in my database were HTML tags and attributes. This led me to fear that people who send HTML mails were likely to find their mails dropped by the spam filter.

As a result I decided to strip out HTML tags and ignore them. The only exception is that URL's contained in them are retained. URLs are fair game for detection since individuals sending genuine emails are unlikely to include the URLs of spammer web sites in them.

Testing on my own email indicated that ignoring HTML tags resulted in a slight loss of accuracy but a lower false positive rate. Also the top terms array is much more satisfying when it contains actual words and URLs rather than hex color codes, tags, etc.

Folding 8 Bit Characters

Many people in the United States who send and receive email only in English can take advantage of 8 bit character folding. Spam in your native language is bad enough but spam in an unknown language is down right infuriating. For example I receive a substantial amount of Korean spam. None of this spam is desirable, although it does have the benefit of being unreadable.

By default (configurable on command line or at build time) SpamProbe converts all characters with the 8th (most significant) bit into the letter 'z'. For example all six character words consisting entirely of 8 bit characters (no matter what those characters are) will be stored in the database as the term "zzzzzz." This could be viewed as a trivial form of lossy compression. The loss of detail is unimportant to people who never receive ham containing large numbers of 8 bit characters and can lead to tremendous savings in disk space as well as extremely efficient classification of Korean spams.

Variable Sized Terms Array

SpamProbe has an extended array option to allow the terms array to grow as much as necessary to accommodate all of the terms in the email which have "distance from mean" greater than or equal to 0.4 (i.e. all terms with spam probability less than or equal to 0.1 or greater than or equal to 0.9). Using this technique provides a safety cushion for emails which contain an abundance of significant terms. In some cases a spam with many "good" terms and even more spammy terms could be incorrectly scored as good because the good terms crowd out the spammy ones (SpamProbe gives good terms first dibs on array slots in the case of ties). Extending the array allows SpamProbe to use all available information when computing the score.

SpamProbe provides a mode (-X command line option) which combines the WDF and variable sized terms array. This mode uses array size 5 and allows each term to repeat up to 5 times. To prevent the obvious flaw in this arrangement, -X mode also uses the extended array mode. The effect of these settings is generally to use only very significant terms and to weight terms which appear frequently in the email heavily. Experiments with my own email indicate that this mode results in slightly lower accuracy but reduces false positives and also improves "recall".

Stubborn Classification

Spam tends to be highly repetitive. Spammers often blast out the same junk mails several times over a period of days. Also if a person has multiple email addresses they will often receive the same spam several times addressed to different accounts. Some spams are harder to classify than others. For example, Paul's "spam of the future" examples contain very few words on which to base a decision. Because of the lack of adequate information in these spams retransmissions of them can be classified incorrectly by the filter even after having been manually classified by a user. Proper classification of an email after it has been manually classified is called "recall."

SpamProbe attempts to improve recall by classifying each email in a loop. The loop repeatedly adjusts the database counts for terms in the email and then scores the email. If the score shows that the email to be misclassified, the process is repeated. This loop can be repeated up to 10 times. Experiments with my corpus indicate that this process can make even most "spams of the future" recognizable as spam.

SpamProbe's -X mode (described above) dramatically improves recall at the expense of some accuracy. All experiments on my corpus eliminate duplicate messages so the impact of improved recall on real world accuracy is uncertain. I suspect that a higher recall at the expense of a small accuracy hit will produce better overall accuracy in the real world since repeats or variations on hard to classify spams should offset the small loss in accuracy on new, unique messages. Fortunately Spammers are not all that creative.

Image Analysis

Spammers have made pathetic attempts to bypass statistical filters over the last few years. Most of these techniques are just plain useless (random word selection, deliberate misspellings, use of digits in place of letters, etc) but one can be troublesome: spams containing few or no words and one or more images. This tactic attempts to get around filters by denying them any words to use when scoring the spam.

As of version 1.4 SpamProbe uses a simple image analysis technique to catch these spams. Whenever SpamProbe sees an image attachment it processes the image to extract a set of tokens from properties of the image itself as well as any meta data contained in the image file. Since most people will receive few images in their real email and many in their spam these image tokens tend to become indicators of spam to the degree that the user normally receives images in their good email. For example in my database most of these terms tend to have spam probabilities above 0.9 and some have probabilities above 0.99. Still others have distinctly low probabilities since they are common in good mail but not in spam.

Preliminary results indicate that this feature provides sufficient extra tokens to allow SpamProbe to capture most of the content-free image spams it encounters without increasing false positives.

Conclusion

Based on my experience with SpamProbe there is no perfect set of optimizations or tweaks to achieve spam filtering perfection. Paul's basic algorithm already solves over 99 percent of the problem. Additional tweaks will ultimately hit the limits of statistical noise. Also the accuracy of any Bayesian filter appears to be highly dependent on the size of the corpus. The more spam a person receives the more accurate the filter will become. My overall accuracy (99.7% in January 2003) has improved every month since I first started using SpamProbe even when I have made no tuning changes in my production server.

Finally a word of advice. An individual user evaluating different spam filters should not give too much weight to the accuracy figures quoted by the authors. Only a side by side comparison on a single corpus can be useful as a means of comparison. Different authors compute their numbers in slightly different ways on their own (unique) email corpus. A user looking to install a filter should pick one that they are comfortable installing and be happy with the results. Any of the filters based on Paul's technique should be extremely accurate for most people.