Department of Computer Science and Engineering
University of California, San Diego
CSE 250B
Fall 2006

Project 3

DUE DATE CHANGED TO WEDNESDAY NOVEMBER 15, 2006, IN CLASS.


This project explores learning binary classifiers when the dimensionality of the data is high compared to the number of training examples.  

We will use two types of data.  The first is the original data for the second project, i.e. the OCR (optical character recognition) data published by Sanjoy Dasgupta at http://www.cs.ucsd.edu/users/dasgupta/250B/homeworks.html.  See his http://www.cs.ucsd.edu/users/dasgupta/250B/hw0.pdf for information about this data and how to manipulate it in Matlab.  Just use the data for the digits "1" and "7" to create a difficult binary classification problem.  

The second dataset consists of subsets of data from the Netflix $1M contest.  For a given movie, the task is to predict whether a customer will like or dislike this movie.  Our datasets for this task consist of, for a given movie, all ratings provided by all customers who have rated this particular movie.  We currently have two such datasets thanks to Thomas Rebotier.  Each dataset has about 17,771 columns (i.e. movies) and many fewer rows (i.e. customers).  Most entries are zero, meaning that this customer has not rated this movie.  In Matlab format, the datasets are here and here.

Your main task is to design and implement a variant of the perceptron algorithm that is successful for both these types of data.  You should start with the voted perceptron method, with an averaged final classifier.  Your implementation should use the kernel trick so that you can run the algorithm with alternative kernels.  You should explore different kernels and possibly different variants of the basic perceptron algorithm.  Try to use the theoretical results on the generalization error of the perceptron to motivate your algorithm designs, and/or your data re-representations.

As before, you must think rigorously about how to do experiments that are conceptually sound.  You may use cross-validation or a fixed training set-test set split, but in either case, be sure not to overfit your test data by using it too often.  Use a nearest-neighbor classifier as a baseline for evaluating the accuracy you can achieve.

For this project, the only deliverable is a well-written report.  As always, "Discuss your results in precise and lucid prose.  Content is king, but looks matter too!"