Assignment 3 Grading Guide
CSE 150 - Fall 2004
There are five parts to this assignment:
1. Write a Preprocessor.
You must write a preprocessor function that inputs text emails, computes the values of chosen features for these emails, and outputs the resulting vectors of feature values. If you like, the preprocessor may be written in a scripting language such as Python or Perl. This function will be used by both the learning and classifying algorithms. You should not have to replicate this code in both programs. In designing the preprocessor, you must choose which features to use. In your report, describe and justify your method for choosing features. You may use any interface you like, but be sure to describe this interface in your README.2. Implement the naive Bayes Learner.
You must write a program that implements the naive Bayes learning algorithm. This program should input the set of labeled training text emails (as opposed to feature vectors) and estimate the necessary statistics. It should then output these statistics to a file to be used by the naive Bayes classifier. Use whatever input and output interfaces you like. In the README header, specify the exact command needed to run your learner on the spam and legit training emails. For instance, if the command to run your learner is:spamNBLearner legit spam > spam_params.out
please specify this in the README header.3. Implement the naive Bayes Classifier.
You must write a program that implements the naive Bayes classifier algorithm. This program should take two argument: 1) the parameters to your classifier, 2) the name of a file consisting of text emails strung together. The emails will be novel spam and legit emails. Using the statistics learned by the naive Bayes learner, it should ouput to stdout the estimated class of each input email: 1 for spam and 0 for legitimate. Each class should be on a separate line. If the name of your classifier executable is spamNBClassifier, then the command to run your classifier on the email messages in the file test would be:spamNBClassifier spam_params.out test
Please specify the name of the classifier executable in the README header. The expected output should be something like:4. Test your classifier.
You are to to test your classifier on each email of this set of unlabeled test emails. You should store the results of yout classifier in the file test_results.out. Part of your grade will be based on the classification error of your algorithm on this test set. You may not hand label this test set and use it in your learning algorithm. Also, you may not look through the test set to find 'good features'. Using the test set in any why to enhance the performance of your classifier and then reporting these enhanced results is a ethical problem in that you are violating the nature of the Scientific Method.5. Write the Report.
Your report should discuss the issues discussed in the Assignment 3 handout:README Header:.
Your README file should have the following header:Partner 1 Name: [fullname of
one partner]
Partner 2 Name: [fullname of other partner]
Partner 1 Login: [login of one partner]
Partner 2 Login: [login of other partner]
NB Learner Command: [full command to run your NB learner]
NB Classifier Executable Name: [name of the command to run your
classifier]
Here is an example README header:
Partner 1 Name: Douglas
Turnbull
Partner 2 Name: Kristin Branson
Partner 1 Login: dturnbul
Partner 2 Login: kbranson
NB Learner Command: spamNBLearner legit spam > spam_params.out
NB Classifier Executable Name: spamNBClassifier spam_params.out test
Your programs should compile with the command make.
Here is the grading guide for assignment 3:
This project will be graded as follows:
* 1/3 for your report
: /50
* 1/3 for your experimental results
: /50
* 1/3 for your code
: /50
* Total
: /150
* Lateness Penality
: %
* Final Score
: /150
---------------------------------------------------------------------
1. Completeness and quality of report [50 pts]:
-[10]Introduction - (Data Mining, SPAM Filtering Problem):
-[10] Naive Bayes Classifier
Description) :
- [5] Description (how/why it
works) :
- [5] Evaluation(Strength + Weakness of
NB):
-[10] Features:
- [5] Features Description (bag of
words,
others):
- [5] Justification (predictive,
alternatives) :
-[10] Conclusion (Lessons Learned):
-[10] Overall Quality of Report - Spelling, Format, Comprehensibility:
---------------------------------------------------------------------
2. Correctness and completeness of your experimental results [50 pts]:
[20] Reported Expected Performance (Cross-Validation Results):
[30] Performance on Test Set :
- 1200 is perfect, ~600 is no better than
random guessing
--------------------------------------------------------------------
3. Correctness and quality of your code [50 pts]:
- [3] README :
- [3] Makefile or single computation :
- [3] Programs compile without errors :
- [8] Preprocessor :
- [10] Naive Bayes Learner :
- [10] Naive Bayes Classifier :
- [10] Performance on Small Test Set :
- [3] Execution Speed and User Interface:
-----------------------------------------------------------------
Additional Comments:
Please contact Douglas Turnbull (dturnbul@cs.ucsd.edu) if you have any
comments or concerns.