Department of Computer Science and Engineering
University of California, San Diego
CSE 250B
Fall 2006

Project 4

UNOFFICIAL DUE DATE MONDAY DECEMBER 4, AT THE FINAL EXAM.


This project explores dimensionality reduction.  The data we will use is a subset of about 9% of the Netflix prize data.  This subset contains 9,637,551 ratings for 1777 movies (columns) from 249,308 users (rows).  Since this subset is still large, you may also use a smaller set of 267,790 ratings of 178 movies by 19,801 users.

Your goal is to find a representation of this matrix with few degrees of freedom that still gives an accurate reconstruction.  For example, you might use projections onto a small number of principal components.  The idea is that this reconstruction method could give useful predictions for ratings that are missing in the matrix.  The measure that you should use to measure the ultimate quality of a reconstruction is mean-squared error (MSE).  When you report an MSE number, you should be clear what the training set and test set are.  For example, you may use the whole matrix for training and then report MSE when reconstructing the whole matrix.  But in this case you do not have independent training and testing sets.  To achieve independence, you may for example randomly select 10% of the nonzero entries of the matrix, not use these for training, and then use them for testing.

Note that the data are sparse.  Most entries are zero, where zero means "missing."  You may handle the zero entries in any way you want during training; this is a major challenge!  When you measure MSE, you should use only test entries that are non-zero, i.e. 1, 2, 3, 4, or 5.  As a baseline for evaluating success, you can use the MSE achieved by predicting, for each movie, the average rating of that movie across all users

For this project, the only deliverable is a well-written report.  As always, "Discuss your results in precise and lucid prose.  Content is king, but looks matter too!"