Laird A. Breyer
dbacl is a UNIX/POSIX command line toolset which can be used in scripts to classify a single email among one or more previously learned categories.
dbacl(1) supports several different statistical models and several different tokenization schemes, which can be adjusted to trade in speed and memory performance for statistical sophistication. dbacl(1) also permits the user to select cost weightings for different categories, thereby permitting simple adjustments to the type I and type II errors (a.k.a. false positives, etc.).
dbacl(1) is a general purpose text classifier which can understand email message formats. The tutorial explains general classification only. It is worth reading (of course :-), but doesn't describe the extra steps necessary to enable the email functionality. This document describes the necessary switches and caveats.
You can learn more about the dbacl suite of utilities (e.g dbacl, bayesol, mailinspect, mailcross) by typing for example:
% man dbacl
A statistical email classifier cannot work unless you show it some (as many as you can) examples of emails for every category of interest. This requires some work, because you must separate your emails into dedicated folders. dbacl works best with mbox style folders, which are the standard UNIX folder type that most mailreaders can import and export.
We'll assume that you want to define two categories called notspam and spam respectively. If you want dbacl(1) to recognize these categories, please take a moment to create two mbox folders named (for example) $HOME/mail/spam and $HOME/mail/notspam. You must make sure that the $HOME/mail/notspam folder doesn't contain any unwanted messages, and similarly $HOME/mail/spam must not contain any wanted messages. If you mix messages in the two folders, dbacl(1) will be somewhat confused, and its classification accuracy will drop.
As time goes by, if you use dbacl(1) for classification, you will probably set up your filing system so that messages identified as spam go automatically into the $HOME/mail/notspam folder, and messages identified as notspam go into the $HOME/mail/notspam folder. dbacl(1) is far from perfect, and can make mistakes. This will result in messages going to the wrong folder. When dbacl(1) relearns, it will become slightly confused and over time its ability to distinguish spam and notspam will be diminished.
As with all email classifiers which learn your email, you should inspect your folders regularly, and if you find messages in the wrong folder, you must move them to the correct folder before relearning. If you keep your mail folders clean for learning, dbacl(1) will eventually make very few mistakes, and you will have plenty of time to inspect the folders once in a while. Or so the theory goes...
To learn your spam category, go to the directory containing your $HOME/mail/notspam folder (we assume mbox type here) and at the command prompt (written %), type:
% dbacl -T email -T xml -l spam $HOME/mail/spam
This reads all the messages in $HOME/mail/notspam one at a time, ignoring certain mail headers and attachments, and also removes HTML markup. If you omit the -T xml option, dbacl(1) also learns the hidden HTML tags, but these often exist also in good emails and will sometimes reduce the effectiveness of classification.
If you get warning messages about the hash size being too small, you need to increase the memory reserved for email tokens. Type:
% dbacl -T email -T xml -H 20 -l spam $HOME/mail/spam
which reserves space for up to 2^20 (one million) unique words. The default dbacl(1) settings are chosen to limit dramatically the memory requirements (about 32 thousand tokens). Once the limit is reached, no new tokens are added and your category models will be strongly skewed towards the first few emails read. Heed the warnings.
If your email isn't kept in mbox format, you can list each email separately on the command line. For example, if your messages are stored in the directory $HOME/mh/, one file per email, you can type
% dbacl -T email -T xml -l spam $HOME/mh/*
Note however that you should becertain that only RFC822 messages are contained in this directory, and you might run into shell command line limitations if you have a very large number of emails. A better (and slightly faster) solution is to temporarily convert your mail into an mbox format file and use that for learning.
It is not enough to learn $HOME/mail/notspam emails, you must also learn the $HOME/mail/notspam emails. dbacl(1) can only choose among the categories it learns. It cannot say that an email is unlike spam, only that an email is like spam or like notspam. To learn the notspam category, type:
% dbacl -T email -T xml -l notspam $HOME/mail/notspam
Make sure to use the same switches for both spam and notspam categories. Once you've fully read the man page later, you can start to mix and match switches.
Every time dbacl(1) learns a category, it writes a binary file containing the statistical model into the hidden directory $HOME/.dbacl, so for example after learning the category spam, you will have a small file named $HOME/.dbacl/spam which contains everything dbacl(1) learned. The file is recreated from scratch each time you relearn spam, and is loaded each time you classify an email.
Suppose you have a file named email.rfc which contains the body of a single email (in standard RFC822 format). You can classify the email into either spam or notspam by typing:
% cat email.rfc | dbacl -T email -T xml -c spam -c notspam -v notspam
All you get is the name of the best category, the email itself is consumed. If you would like to see scores for each category, type:
% cat email.rfc | dbacl -T email -T xml -c spam -c notspam -n spam 232.07 notspam 229.44
The winning category always has the score closest to zero. In fact, the numbers returned with the -n switch are practically distances towards each category. If you prefer a return code, dbacl(1) returns a positive integer (1, 2, 3, ...) identifying the category by its position on the command line. So if you type:
% cat email.rfc | dbacl -T email -T xml -c spam -c notspam
then you get no output, but the return code is 2. If you use the bash(1) shell, the return code for the last run command is always in the variable $?.
There is generally little point in running the commands above by hand, except if you want to understand how dbacl(1) operates, or want to experiment with switches.
Note, however, that simple scripts often do not check for error and warning messages on STDERR. It is always worth rehearsing the operations you intend to script, as dbacl(1) will let you know on STDERR if it encounters problems during learning. If you ignore warnings, you will likely end up with suboptimal classifications.
Once you are ready for spam filtering, you need to handle two issues.
The first issue is when and how to learn.
You should relearn your categories whenever you've received an appreciable number of emails. A category model normally doesn't change dramatically if you add a single new email (provided the original model depends on more than a handful of emails). The simplest strategy is a cron(1) job run once a day:
% crontab -l > existing_crontab.txt
Edit the file existing_crontab.txt with your favourite editor and add the following three lines at the end:
CATS=$HOME/.dbacl 5 0 * * * dbacl -T email -T xml -H 18 -l $CATS/spam $HOME/mail/notspam 10 0 * * * dbacl -T email -T xml -H 18 -l $CATS/notspam $HOME/mail/notspam
Now you can install the new crontab file by typing
% crontab existing_crontab.txt
The second issue is how to invoke and what to do with the dbacl classification.
Many UNIX systems offer procmail(1) for email filtering. procmail(1) can pipe a copy of each incoming email into dbacl(1), and use the resulting category name to write the message directly to the appropriate mailbox.
To use procmail, first verify that the file $HOME/.forward exists and contains the single line:
|/usr/bin/procmail
Next, create the file $HOME/.procmailrc and make sure it contains something like this:
PATH=/bin:/usr/bin:/usr/local/bin SHELL=/bin/bash MAILDIR=$HOME/mail DEFAULT=$MAILDIR/inbox # # this line runs the spam classifier # :0 c YAY=| dbacl -vT email -T xml -c $HOME/.dbacl/spam -c $HOME/.dbacl/notspam # # this line writes the email to your mail directory # :0: * ? test -n "$YAY" # if you prefer to write the spam status in a header, # comment out the first line and uncomment the second $MAILDIR/$YAY #| formail -A "X-DBACL-Says: $YAY" >>$DEFAULT # # last rule: put mail into mailbox # :0: $DEFAULT
The above script will automatically file your incoming email into one of two folders named $HOME/mail/spam and $HOME/mail/notspam respectively (if you have a POP account, and your mailreader contacts your ISP directly, this won't work. Try using fetchmail(1)).
The classification performed by dbacl(1) as described above is known as a MAP estimate. The optimal category is chosen only by looking at the email contents. What is missing is your input as to the costs of misclassifications.
To understand the idea, imagine that an email being wrongly marked spam is likely to be sitting in the $HOME/mail/spam folder until you check through it, while an email wrongly marked notspam will prominently appear among your regular correspondence. For most people, the former case can mean a missed timely communication, while the latter case is merely an annoyance.
No classification system is perfect. Learned emails can only imperfectly predict never before seen emails. Statistical models vary in quality. If you try to lower one kind of error, you automatically increase the other kind.
The dbacl system allows you to specify how much you hate each type of misclassification, and does its best to accomodate this extra information. To input your settings, you will need a risk specification like this:
categories { spam, notspam } prior { 1, 1 } loss_matrix { "" spam [ 0, 1^complexity ] "" notspam [ 2^complexity, 0 ] }
This risk specification states that your cost for misclassifying spam emails into notspam is 1 for every word of the email (merely an annoyance). Your cost for misclassifying regular emails into spam is 2 for every word of the email (a more serious problem). The costs for classifying your email correctly are zero in each case. Note that the cost numbers are arbitrary, only their relative sizes matters. See the tutorial if you want to understand these statements.
Now save your risk specification above into a file named my.risk, and type
% cat email.rfc | dbacl -T email -T xml -c spam -c notspam \ -vna | bayesol -c my.risk -v notspam
The output category may or may not differ from the category selected via dbacl(1) alone, but over many emails, the resulting classifications will be more cautious about marking an email as spam.
Since dbacl(1) can output the score for each category (using the -n switch), you are also free to do your own processing and decision calculation, without using bayesol(1). For example, you could use:
% cat email.rfc | dbacl -T email -T xml -n -c spam -c notspam | \ awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } \ else { print $3; } }'
where p1 is the a priori probability that an email is spam, u12 is the cost of misclassifying spam as notspam, and u21 is the cost of seeing spam among your regular email.
When dbacl(1) inspects an email message, it only looks at certain words/tokens. In all examples so far, the tokens picked up were purely alphabetic words. No numbers are picked up, or special characters such as $, @, % and punctuation.
The success of text classification schemes depends not only on the statistical models used, but also strongly on the type of tokens considered. dbacl(1) allows you to try out different tokenization schemes. What works best depends on your email.
By default, dbacl(1) picks up only purely alphabetic words as tokens (this uses the least amount of memory). To pick up alphanumeric tokens, use the -e switch as follows:
% dbacl -e alnum -T email -T xml -l spam $HOME/mail/notspam % dbacl -e alnum -T email -T xml -l notspam $HOME/mail/notspam % cat email.rfc | dbacl -T email -T xml -c spam -c notspam -v notspam
You can also pick up printable words (use -e graph) or purely ASCII (use -e ascii) tokens. Note that you do not need to indicate the -e switch when classifying, but you should make sure that all the categories use the same -e switch when learning.
dbacl(1) can also look at single words, consecutive pairs of words, triples, quadruples. For example, a trigram model based on alphanumeric tokens can be learned as follows:
% dbacl -e alnum -w 3 -T email -T xml -l spam $HOME/mail/notspam
One thing to watch out for is that n-gram models require much more memory to learn in general. You will likely need to use the -H switch to reserve enough space.
If you prefer, you can specify the tokens to look at through a regular expression. The following example picks up single words which contain purely alphabetic characters followed by zero or more numeric characters. It can be considered an intermediate tokenization scheme between -e alpha and -e alnum:
% dbacl -T email -T xml \ -g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \ -l spam $HOME/mail/notspam % dbacl -T email -T xml \ -g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \ -l notspam $HOME/mail/notspam % cat email.rfc | dbacl -T email -T xml -c spam -c notspam -v notspam
Note that there is no need to repeat the -g switch when classifying.