Home

Download

Project Page

Conference Paper

Multi-User Server Overview

Donations

Burton Computer Corporation - the author

SpamProbe Logo

Introduction

Welcome to SpamProbe! Are you tired of the constant bombardment of your inbox by unwanted email pushing everything from porn to get rich quick schemes? Have you tried other spam filters but become disenchanted with them when you realized that their manually generated rule sets weren't updated fast enough to keep up with spammers wording changes? Or that they generated unwanted false positive scores?

SpamProbe operates on a different basis entirely. Instead of using pattern matching and a set of human generated rules SpamProbe relies on a Bayesian analysis of the frequency of words used in spam and non-spam emails received by an individual person. The process is completely automatic and tailors itself to the kinds of emails that each person receives.

SpamProbe was inspired by an excellent article by Paul Graham. He describes the basic idea and his results. You can read his article here:

http://www.paulgraham.com/spam.html

I highly recommend reading the article and the other spam related links on his site for profound insights into why spam is a problem and how you can defeat it. We all owe Paul a debt of gratitude for showing us the way to spam-free living.


Features

  • Spam detection using Bayesian analysis of terms contained in each email. Words used often in spams but not in good email tend to indicate that a message is spam. Generally over 90% effective at detecting spam once a few hundred spams have been classified. My personal database is over 99% effective.
  • Automatically learns from incoming mails as they are classified. Incorporates user's feedback to tailor classification to each user's personal tastes.
  • Works with procmail, maildrop, or a similar tool to produce a complete server or client side spam filtering system.
  • Written in C++ for good performance. Database access using Peter Graf's PBL ISAM library or Berkeley DB for quick startup and fast term count retrieval. Also supports a fast, fixed size hash file format for maximum speed or when a fixed size database is essential.
  • Recognition and decoding of MIME attachments in quoted-printable and base64 encoding. Automatically skips non-text attachments. MIME decoding enables SpamProbe to make decisions based on words in the emails rather than base64 gobbledigook.
  • Analyzes image attachments to derive useful information from them. This feature allows SpamProbe to detect spams that contain an image and little to no text content.
  • Counts two word phrases as well as single words for higher precision. Can easily be configured to use longer phrases if desired.
  • Ignores HTML tags in emails for scoring purposes unless the -h command line option is used. Many spams use HTML and few humans do so HTML tends to become a powerful recognizer of spams. However in the author's opinion this also substantially increases the likelihood of false positives if someone does send a non-spam emai containing HTML tags. SpamProbe does pull urls from inside of html tags however since those tend to be spammer specific.
  • Locks mboxes and databases using fcntl file locking to avoid problems when multiple emails arrive simultaneously.
  • Scores only the Received, Subject, To, From, and Cc headers. All other headers are ignored to make it hard for spammers to hide non-spammy words in X- headers to fool the filter. The -H command line option can be used to override this.
  • Natively supports mbox, MBX, and Maildir mail box formats.
  • Supports Content-Length: field in mbox headers. This can be disabled using -Y option to use only From_ to recognize new messages.
  • Uses MD5 hash of emails to recognize reclassification of an already classified spam to avoid distortion of the word counts if emails are reclassified. This way emails can be kept in a mailbox that is repeatedly scanned by spamprobe without counting them more than once.
  • Provides a date stamp based database cleanup command to remove terms from the database if their counts never rise above a certain threshold value (normally 2).
  • Provides an edit-term command allowing users to directly modify the counts of individual terms. For example to force a particular term to be considered spammy or good.


Known Platforms

SpamProbe is known to compile and run on a wide range of *nix systems including Linux (RedHat and Debian), FreeBSD, Solaris, AIX, MacOS X, and Darwin. SpamProbe can also be compiled to run on Windows under the Cygwin environment. (If you compile and run SpamProbe on a system not mentioned here please notify me so that I can add it to the list!) SpamProbe requires another program to actually label and file spams. procmail, maildrop are popular systems for this purpose. If you want to use an ISAM database you must have Peter Graf's PBL or Berkeley DB installed on your computer. Alternatively you can use the custom hash file format (see the README.txt file for details). In order to use the image analysis feature you will need to have libungif installed. The README file contains information about configuring procmail to work with SpamProbe.


Licensing Terms

SpamProbe is open source software and anyone is free to use it on their computers without any fees.  The source code is distributed by Burton Computer Corporation under the terms of the Q Public License.  Basically the license says that you are free to use SpamProbe and redistribute it to others.  You can even write programs using the SpamProbe source code as long as you make your own source code freely available to others under the QPL.


Support and Warranty

There is none! There is NO WARRANTY at all with this software. Read the QPL for details. YOU ASSUME ALL RISK when using this software.


Resources

  • The paper that started it all. Paul Graham's A Plan For Spam inspired dozens of developers to follow his lead and make people's inboxes readable again through bayesian filtering.
  • Jem Berkes has developed a web interface for SpamProbe to allow users to access their spam and good corpus from their web browser. This could be useful for people who are using SpamProbe but not IMAP.
  • Dave Person has written an emacs library, spamprobe.el containing useful macros for use with spamprobe.


Community

Be sure to visit the project page on sourceforge. There you can submit bug reports or feature requests, read and post messages on the forums, and download the latest version.

http://sourceforge.net/projects/spamprobe/

You can also join the spamprobe mailing list to discuss issues with other SpamProbe users. Most of the serious discussion of SpamProbe takes place in the mailing list.

http://lists.sourceforge.net/lists/listinfo/spamprobe-users




Copyright © 2002-2005
Burton Computer Corporation
SourceForge Logo