NB: Bogotune is a tool (shipped with bogofilter) that automates the tuning process. Its "full search" mode performs a five-dimensional grid search over possible values of the parameters to be described below, and comes up with recommendations for optimal settings. There's also a "partial search" mode that is only three-dimensional. If you have enough spam and nonspam messages (at least 2,500 of each), using bogotune is highly recommended for optimizing bogofilter's accuracy.
Each of Gary Robinson's proposed classification methods works better than the earlier versions. For optimal results, they (and the original) require some tuning. As distributed, bogofilter attempts to supply good starting values for the tunable parameters. Since the optimum values depend on the size and content of the wordlists at your installation, the best results can only be determined by some experimentation using your wordlists. After several thousand each of spam and nonspam messages have been fed to bogofilter for training, this experimentation can be well worthwhile: you may be able to cut the number of spams that are still getting through by more than half.
The purpose of this document is to explain how to adjust bogofilter's parameters for best spam filtering. A manual tuning process is described; though you'll be wise to use bogotune to help find your parameter values, understanding the manual process will help you make sense of what bogotune does.
With Robinson's changes as implemented in bogofilter, there are seven (five without effective size factors) things to tune, six (or four) of which are highly interdependent, as explained below:
The way bogofilter works, summarizing briefly, is that the message being classified is separated into "tokens" -- words, IP addresses and other logical units of information. Each token is looked up in the wordlist that makes up the training database. The number of times it's been seen in a spam message is divided by the total number of times it's been seen, and that gives an indication of the likelihood that the token is in a spam message. The likelihood estimates for all the tokens are combined to give a score between 0 and 1 -- 0 means the message is not likely to be a spam, 1 means it is. That's fine, but what happens if a token's never been seen before? That's where x comes in: it's a "first guess" at what the presence of an unknown token means, in terms of its contribution to the score. It's the value used as the likelihood estimate when a new token is found.
Obtaining a value for x is easy: bogoutil will calculate it for you.
Assuming your bogofilter wordlist is in ~/.bogofilter, run
bogoutil -r
This will print out an x value. If it's in the range of 0.4 to 0.6,
you can run
bogoutil -R ~/.bogofilter
to install the
calculated value so bogofilter will use it from then on.
The value of x that bogoutil calculates is just the average of
p(w) = badcount / (goodcount + badcount)
To be honest, that's an oversimplification, for the sake of explaining the basic concept clearly. In real life, you have to scale the counts somehow. If you had exactly the same number of spam and nonspam messages contributing to your wordlist, the formula for p(w) given above would be ok; but we actually have to use
p(w) = (badcount / badlist_msgcount) / (badcount / badlist_msgcount + goodcount / goodlist_msgcount)
An equivalent way of calculating x, that's a little easier to read, is:
scalefactor = badlist_msgcount / goodlist_msgcount p(w) = badcount / (badcount + goodcount * scalefactor)
In either case, x is the average of those p(w) values.
The calculated x is just a first guess, and it may be worth while experimenting (after tuning s and the minimum deviation as described below) to see what happens if you adjust it up or downward within a range of about 0.1 either way.
The compromise works like this: a parameter s is defined, that serves
as a weighting factor; the larger the value of s, the more importance
is given to x in the presence of low token counts. The token
count is
n = badcount + goodcount
and the p(w) value for the token is modified as follows:
f(w) = (s * x + n * p(w))/(s + n)
Parameters s and x are only important when the counts are small, and the value of s reflects what we think of as "small." If s is large, then when counts are small we trust our x value more than we do the p(w); if s is small, we give more weight to p(w) and less to x.
So how big should s be? Gary Robinson suggested, on a theoretical basis, that we start with a value of 1; it might be worth trying values in the range of 0.01 to 10, though I've never had good results with values larger than 1. Making s smaller than 0.01 or so is a bad idea, because of what happens when a token is encountered that's been seen in spam before, but never in nonspam, or vice versa. In that case, p(w) is exactly 1 or 0, and f(w) will vary greatly as the value of s diminishes. As an example, one spam that had about ten such tokens, out of 78 that contributed to the spam score, scored 0.999 with s set to 0.001, and was found to score 0.505 when s was 1.0e-8!
Choosing a value for s is a matter of trial and error. Experience suggests that 0.1 might be a better starting value than 1, at least if your training database is moderately large.
Note that higher values of MIN_DEV accentuate the distortion caused by small s, because tokens appearing in only one of the spam and nonspam counts will (if present) be a higher proportion of the total number of tokens considered.
P = prbx(-2 * sum(ln(1-f(w))), 2*N) Q = prbx(-2 * sum(ln(f(w))), 2*N) S = (1 + Q - P) / 2
P = prbx(-2 * ln(prod(1-f(w))^y), 2*N*y) Q = prbx(-2 * ln(prod(f(w))^z), 2*N*z) S = Q / (Q + P)
Determining the correct values to use for y and z is, like choosing the value for s, a matter of trial and error. A significant corpus of messages is required and users who don't have access to large collections of their spam and nonspam messages should probably opt to do without ESF. Useful values to try seem to be in the range of 0.75 raised to powers between 1 and 20.
The best value for the spam cutoff depends strongly on the values of the x, s, MIN_DEV and ESF parameters that are being used in the calculation. For that reason, the way to test a given parameter set is the following:
The value for the nonspam threshold should be such that no more than about one in ten thousand spams is classified as nonspam. A value of 0.2 to 0.25 might be a good starting point for use with the recommended s and MIN_DEV (0.1 and 0.35 respectively).
With these preliminaries completed, bogotune performs two scans: The first is a coarse grid search over the entire useful range of each parameter, which should locate an approximate optimum. A finer grid, centered on the optimum derived from the coarse search, is then scanned to produce the final recommendations. At the end of each scan, outliers -- apparently good parameter sets from which even slight deviation degrades performance significantly -- are rejected. Bogotune usually manages to find a robust parameter set that gives good discrimination between spam and nonspam messages, provided that sufficient training and test messages are supplied.
There are two reasons why one might want to force a bogotune run to use a specific false-positive target. One is that sometimes bogotune's target calculation algorithm is overly optimistic, i.e. it occasionally sets the target too low. The symptom of this problem is that many parameter combinations in the coarse grid scan simply can't deliver that few false positives from the test message corpus. Manually increasing the target by 20 to 30 percent usually fixes this. The other reason to coerce the target to a specific value is that one might want to compare two bogotune runs -- with and without ESF, for example -- and comparisons aren't valid unless both runs use the same target.
Bogotune may produce very different recommendations for very similar sets of spam and nonspam messages. That's not necessarily a defect. For many message populations, the scoring algorithm doesn't depend heavily on precise values of any of the parameters but the spam cutoff. In such cases it's common to see bogotune's coarse scan settle on one of several local optima that may have quite different parameter values. In general, the parameter values have more influence when a single training database is used for a large number of users, and are less crucial when the wordlist is for just one or a few users.
How often to tune the other parameters depends on how fussy you are about optimizing performance. If you're eager to get the best discrimination you can, here are my recommendations: In the early stage, while the training database is still small and growing rapidly, it's probably wise to experiment with tuning x, s and MIN_DEV once a month or so. Once the training database is a good size (over 5000 spams and 5000 nonspams), this can be reduced to quarterly or half-yearly. If you use ESF (which I recommend you do), retuning should probably happen a bit more often, especially if you see that the characteristics of spam you're receiving seem to be changing.
FWIW my own practice, after two years' experience, is to review the spam cutoff monthly and do a bogotune run about quarterly.
When starting afresh, feed every spam and every nonspam you get into
bogofilter. Do not use bogofilter's -u option to do this: there will
be far too many errors when your training database is small. Instead,
classify messages manually and train bogofilter with the -n and -s
options appropriately. You can do it in batches: if you work with
standard mbox files, use a mail reader to move spam and nonspam into
separate files, then do
bogofilter -s < spambox
and
bogofilter -n < nonspambox
to register
the messages.
To find out how many spam and nonspam messages have gone into your wordlist, assuming it's kept in ~/.bogofilter, do
bogoutil -w ~/.bogofilter .MSG_COUNT
Once you've accumulated about 5,000 spam or nonspam messages in the list, you need to let the other count catch up unless they're growing at about the same rate. Stop adding every message to the larger of the two counts, and instead, add only messages that bogofilter got wrong or was unsure about. To do this, you need to start classifying messages into 4 sets instead of 2: spam, nonspam, unsures that were actually spam, and unsures that were nonspam.
Once the counts are both over 5,000 and fairly similar, you can train only on errors and unsures. By this time there should be very few errors (spams classed as nonspam or vice versa), but there will still be a proportion of unsure spam and unsure nonspam messages. Training on these will keep bogofilter working well, as you're telling it what it needs to learn, and not so much of what it already knows. If one list grows faster than the other, extra (correctly classified) messages may be added from time to time to equalize them again; try to keep the smaller list's message count at least two thirds of the larger's.
The use of bogofilter's -u option is convenient but dangerous. Even when well trained, bogofilter will misclassify a small number of messages. With the -u option, these mistakes are entered into the database and decrease its accuracy. More mistakes result and the process snowballs. You therefore need to review and correct the databases frequently, and in the interval between such reviews, bogofilter's effectiveness will keep falling off. The method described above has the advantage that (human error excepted) no wrongly classified messages are ever entered into the database, and if you have to leave it for a time without updating it, its effectiveness doesn't diminish.
Much of the advice given here arises out of experiments reported on the author's bogofilter web pages; in particular, the report at www.bgl.nu/bogofilter/smindev3.html might interest those who'd like more information about the basis of bogofilter tuning.
Thanks go to David Relson for reviewing this document in draft and suggesting several improvements.
[© Greg Louis, 2004; last modified 2004-09-09]