# Homework: Bayesian Predictive Spam Filtering

Bart Massey
Due Monday before class

1. In spam filtering, you look at evidence that a message is spam in order to try to decide whether it is spam or "ham". Two things matter: how effective your test is, and how likely it is that the various messages are spam to begin with.

In this exercise, we will imagine that spam filtering is to be performed by measuring a single feature: in this case, whether there is an exclamation point in the message.

2.1. Construct a Python program (gen-email.py) that prints randomly-generated lines. Each line should contain first a 0 if a hypothetical message is "ham" or a 1 if it is spam, then a 0 if the message contains no exclamation point or a 1 otherwise.

You should be able to easily specify parameters of your generator: number of lines, probability that a given line is spam, probability that a ham line contains an exclamation point, probability that a spam line contains an exclamation point.

For example, if you set n = 10, frac_ham = 0.66, frac_ham_excl = 0.2, frac_spam_excl = 0.9, you might see an output like:

       1 1
0 0
0 0
1 0
0 0
0 0
0 1
0 0
0 0
1 1


2.2. Now, build a Python program (spam-bayes.py) that reads data in this format and produces a table estimating the probabilities that a future message m is ham or spam based on whether an exclamation point is or is not present in m.

To do this, you will need to count the number of ham and spam messages (as described by the first field of lines in the file), the number of messages with and without an exclamation point (as described by the second field of lines in the file), and separately the number of exclamation points in ham and spam messages. Then you can compute, for example:

       Pr(S|E) = Pr(E|S) Pr(S) / Pr(E)


For the sample data above, your output should be a table that looks something like:

           N    E
H 0.86 0.33
S 0.14 0.67


2.3. Note that in our example the presence of an exclamation point isn't a very good test: it will likely predict ham on 14% of spam messages and spam on 33% of ham messages. Vary the parameters of your generator and notice how the performance improves or worsens.

There are ways to improve the performance of the Bayesian approach in spam filtering, most notably by using more than one feature per message. See Section 4.2.2 of this paper for one way to extend the Bayesian approach to multiple features.

Turn in this assignment using Crossgrade.