Department of Computer Science and Engineering
University of California, San Diego
CSE 250B
Fall 2006
Project 3
DUE DATE CHANGED TO WEDNESDAY NOVEMBER 15, 2006, IN CLASS.
This project explores learning binary classifiers when the
dimensionality of the data is high compared to the number of
training examples.
We will use two types of data. The first is the original data for
the second project, i.e. the OCR (optical character recognition)
data published by Sanjoy Dasgupta at http://www.cs.ucsd.edu/users/dasgupta/250B/homeworks.html. See his http://www.cs.ucsd.edu/users/dasgupta/250B/hw0.pdf
for information about this data and how to manipulate it in Matlab.
Just use the data for the digits "1" and "7" to create a
difficult binary classification problem.
The second dataset consists of subsets of data from the Netflix $1M contest.
For a given movie, the task is to predict whether a customer will
like or dislike this movie. Our datasets for this task consist
of, for a given movie, all ratings provided by all customers who have
rated this particular movie. We currently have two such
datasets thanks to Thomas Rebotier. Each dataset has about
17,771 columns (i.e. movies) and many fewer rows (i.e. customers).
Most entries are zero, meaning that this customer has not rated
this movie. In Matlab format, the datasets are here and here.
Your main task is to design and implement a variant of the
perceptron algorithm that is successful for both these types of data.
You should start with the voted perceptron method, with an
averaged final classifier. Your implementation should use the
kernel trick so that you can run the algorithm with alternative
kernels. You should explore different kernels and possibly
different variants of the basic perceptron algorithm. Try to use
the theoretical results on the generalization error of the perceptron
to motivate your algorithm designs, and/or your data
re-representations.
As before, you must think rigorously about
how to do experiments that are conceptually sound. You may
use cross-validation or a fixed training set-test set split, but in
either case, be sure not to overfit your test data by using it too
often. Use a nearest-neighbor classifier as a baseline for
evaluating the accuracy you can achieve.
For this project, the only deliverable is a well-written report.
As always, "Discuss your results in precise and lucid
prose. Content is king, but looks matter too!"