Assignment 3 Grading Guide
CSE 150 - Fall 2004

There are five parts to this assignment:

  1. Writing an email preprocessor.
  2. Implementing the naive Bayes learning algorithm.
  3. Implementing a naive Bayes classifier.
  4. Testing your classifier.
  5. Writing the report.

1. Write a Preprocessor.

You must write a preprocessor function that inputs text emails, computes the values of chosen features for these emails, and outputs the resulting vectors of feature values. If you like, the preprocessor may be written in a scripting language such as Python or Perl. This function will be used by both the learning and classifying algorithms. You should not have to replicate this code in both programs. In designing the preprocessor, you must choose which features to use. In your report, describe and justify your method for choosing features. You may use any interface you like, but be sure to describe this interface in your README.

2. Implement the naive Bayes Learner.

You must write a program that implements the naive Bayes learning algorithm. This program should input the set of labeled training text emails (as opposed to feature vectors) and estimate the necessary statistics. It should then output these statistics to a file to be used by the naive Bayes classifier. Use whatever input and output interfaces you like. In the README header, specify the exact command needed to run your learner on the spam and legit training emails. For instance, if the command to run your learner is:

spamNBLearner legit spam > spam_params.out

please specify this in the README header.

3. Implement the naive Bayes Classifier.

You must write a program that implements the naive Bayes classifier algorithm. This program should take two argument: 1) the parameters to your classifier, 2) the name of a file consisting of text emails strung together. The emails will be novel spam and legit emails. Using the statistics learned by the naive Bayes learner, it should ouput to stdout the estimated class of each input email: 1 for spam and 0 for legitimate. Each class should be on a separate line. If the name of your classifier executable is spamNBClassifier, then the command to run your classifier on the email messages in the file test would be:

spamNBClassifier spam_params.out test

Please specify the name of the classifier executable in the README header. The expected output should be something like:
1
1
0
1
0
...

4. Test your classifier.

You are to to test your classifier on each email of this set of unlabeled test emails. You should store the results of yout classifier in the file test_results.out. Part of your grade will be based on the classification error of your algorithm on this test set. You may not hand label this test set and use it in your learning algorithm. Also, you may not look through the test set to find 'good features'. Using the test set in any why to enhance the performance of your classifier and then reporting these enhanced results is a ethical problem in that you are violating the nature of the Scientific Method.

The test file consists of 1200 emails, of which approximately half are spam and half are legit. Your  file should be exactly 1200 lines long where each line contains either a '0' or a '1'. During grading, we will check your results using the unix command:

$ cmp -l true_labels.out test_results.out | grep -c ""

where true_labels.out is the our key and test_results.out is your "best guess" at  the true labels. Here is a  random_results.out file which maybe helpful to test if your output file format is correct. Note that this file will only correctly classify about 600 of the 1200 of the emails.

5. Write the Report.

Your report should discuss the issues discussed in the Assignment 3 handout: You should also describe the classifier that your software learns, numerically and qualitatively.  Which features (and which values of which features) are most predictive of a message being spam?  How many features contribute significantly to classification, versus how many are uninformative?  What features might you want to add to your feature set in the future, to get a better classifier?  Is it adequate to represent a message as a "bag of words," or does this lose too much information?

It may be helpful to review the grading guide below.

README Header:.

Your README file should have the following header:

Partner 1 Name: [fullname of one partner]
Partner 2 Name: [fullname of other partner]
Partner 1 Login: [login of one partner]
Partner 2 Login: [login of other partner]
NB Learner Command: [full command to run your NB learner]
NB Classifier Executable Name: [name of the command to run your classifier]

Here is an example README header:

Partner 1 Name: Douglas Turnbull  
Partner 2 Name: Kristin Branson
Partner 1 Login: dturnbul
Partner 2 Login: kbranson
NB Learner Command: spamNBLearner legit spam > spam_params.out
NB Classifier Executable Name: spamNBClassifier spam_params.out test

Your programs should compile with the command make.

Here is the grading guide for assignment 3:


CSE 150 - Grading Form for Assignment 3
November 17, 2004

This project will be graded as follows:

    * 1/3 for your report                     :       /50
    * 1/3 for your experimental results :       /50
    * 1/3 for your code                       :       /50
   
    * Total                                          :     /150
    * Lateness Penality                        :     %
    * Final Score                                 :    /150

---------------------------------------------------------------------
1. Completeness and quality of report [50 pts]:

-[10]Introduction - (Data Mining, SPAM Filtering Problem):

-[10] Naive Bayes Classifier Description)        :
      - [5] Description (how/why it works)        :
      - [5] Evaluation(Strength + Weakness of NB):            
     
-[10] Features:
      - [5] Features Description (bag of words, others):                   
      - [5] Justification (predictive, alternatives)  :

-[10] Conclusion (Lessons Learned):

-[10] Overall Quality of Report - Spelling, Format, Comprehensibility:

---------------------------------------------------------------------
2. Correctness and completeness of your experimental results [50 pts]:

[20] Reported Expected Performance (Cross-Validation Results):

[30] Performance on Test Set :
     - 1200 is perfect, ~600 is no better than random guessing


--------------------------------------------------------------------
3. Correctness and quality of your code [50 pts]:

- [3] README  :
- [3] Makefile or single computation :
- [3] Programs compile without errors :
- [8] Preprocessor :
- [10] Naive Bayes Learner  :   
- [10] Naive Bayes Classifier :
- [10] Performance on Small Test Set  :
- [3] Execution Speed and User Interface:

-----------------------------------------------------------------
Additional Comments:

Please contact Douglas Turnbull (dturnbul@cs.ucsd.edu) if you have any comments or concerns.