Department of Computer Science and Engineering CSE 150
University of California at San Diego Fall 2004

Assignment 3

DUE WEDNESDAY NOVEMBER 17, 2004, AT 5 PM.



The task here is to build a spam filter using Bayesian learning.  You should implement the naive Bayes learning algorithm (NB) described in class, and apply it to learn a classifier that distinguishes spam from legitimate email messages as accurately as possible.

You must write a preprocessor that converts an email message into a fixed-length vector of feature values.  Use your creativity in choosing features that you think are likely to be predictive, i.e. likely to be different for the two classes of messages.  Use at least 100 different features, and make sure your NB implementation can handle at least 200 features and 2000 training examples.  Most of your features are likely to be words, but also use other features such as the time of day of the message.  Use human intelligence to avoid features that may be highly predictive for the training data, but that will not be useful for test data, for example the date of the email message.

Here is a training set of about 500 spam messages and here is a training set of about 500 legitimate messages (links removed).  You will have to divide each file into its separate email messages; this is not difficult.  You must also write a script to preprocess each message, to compute the value for it of each feature that you use.

On Friday November 12, we will publish a test set of about 500 mixed spam and legitimate email messages.  You will run your classifier to generate a file of predicted labels: 1 for spam and 0 for legitimate.  We will compute the percentage of correct labels that you submit, and part of your score will be based on this percentage.  Note that for this assignment, both types of mistake (classifying spam as legitimate, and vice versa) are equally bad.

Your report should include discussions of the following issues:
In your report, you should also describe the classifier that your software learns, numerically and qualitatively.  Which features (and which values of which features) are most predictive of a message being spam?  How many features contribute significantly to classification, versus how many are uninformative?  What features might you want to add to your feature set in the future, to get a better classifier?  Is it adequate to represent a message as a "bag of words," or does this representation lose too much information?