Department of Computer Science and Engineering
University of California, San Diego
UNOFFICIAL DUE DATE MONDAY DECEMBER 4, AT THE FINAL EXAM.
This project explores dimensionality reduction. The data we will use is a subset of about 9% of the Netflix prize
data. This subset contains 9,637,551 ratings for 1777 movies
(columns) from 249,308 users (rows). Since this subset is still
large, you may also use a smaller set of 267,790 ratings of 178 movies by 19,801 users.
Your goal is to find a representation of this matrix with few degrees
of freedom that still gives an accurate reconstruction. For
example, you might use projections onto a small number of principal
components. The idea is that this reconstruction method could
give useful predictions for ratings that are missing in the matrix.
The measure that you should use to measure the ultimate quality
of a reconstruction is mean-squared error (MSE). When you report
an MSE number, you should be clear what the training set and test set
are. For example, you may use the whole matrix for training and
then report MSE when reconstructing the whole matrix. But in this
case you do not have independent training and testing sets. To
achieve independence, you may for example randomly select 10% of the
nonzero entries of the matrix, not use these for training, and then use
them for testing.
Note that the data are sparse. Most entries are zero, where zero
means "missing." You may handle the zero entries in any way you
want during training; this is a major challenge! When you measure
MSE, you should use only test entries that are non-zero, i.e. 1, 2, 3,
4, or 5. As a baseline for evaluating success, you can use the
MSE achieved by predicting, for each movie, the average rating of
that movie across all users
For this project, the only deliverable is a well-written report.
As always, "Discuss your results in precise and lucid
prose. Content is king, but looks matter too!"