Data is available at http://cseweb.ucsd.edu/~jmcauley/pml/data/. Download and save to your own directory

Amazon musical instrument review data. Originally from https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt

Dataset contains the following fields

Parse the data and convert fields to integers where needed

One row of the dataset (as a python dictionary)

Extract a few utility data structures

Extract per-user and per-item averages (useful later for rating prediction)

Similarity metrics

Jaccard

Cosine

Simple implementation for set-structured data

Or for real values (e.g. ratings). Note that this implementation uses global variables (usersPerItem, ratingDict), which ideally should be passed as parameters.

Pearson

Retrieve the most similar items to a given query

In this case, based on the Jaccard similarity

Choose an item to use as a query

Retrieve the most similary items

Print names of query and recommended items

Faster implementation

Confirm that results are the same...

Similarity-based rating estimation

Use our similarity functions to estimate ratings. Start by building a few utility data structures.

Rating prediction heuristic (several alternatives from Chapter 4 could be used)

Predict a rating for a particular user/item pair

Compute the MSE for a model based on this heuristic

Compared to a trivial predictor which always predicts the mean

Get predictions for all instances (fairly slow!)

Exercises

4.1

(implementation is provided via the function mostSimilarFast above)

4.2

(using Amazon musical instruments data from examples above)

4.3

Results on this dataset aren't particularly interesting. Could try with a denser dataset (so that many items have non-zero similarity) to get more interesting results.

4.4

(following code and auxiliary data structures from the examples above)

Equation 4.20

Equation 4.21

Equation 4.22