Similarity-based recommendation

The first recommender system we'll implement is a simple simaliry-based recommender that makes recommendations based on the Jaccard similarity between items.

We'll start with several standard imports:

In [1]:
import gzip
from collections import defaultdict
import scipy
import scipy.optimize
import numpy
import random

And load the data much as before, converting integer-valued fields as we go. Note that here we use a larger "Musical Instruments" dataset for the sake of demonstrating a more scalable system, but you could repeat this exercise with any dataset (in particular, you could use a smaller one if these exercises run slowly on your machine). Again, this dataset was collected from https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt and should be saved locally.

In [3]:
path = "/Users/johnn/Dropbox/datasets/amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz"
In [4]:
f = gzip.open(path, 'rt', encoding="utf8")
In [5]:
header = f.readline()
header = header.strip().split('\t')
In [6]:
dataset = []
In [7]:
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

Let's examine one of the entries in this dataset:

In [8]:
dataset[0]
Out[8]:
{'marketplace': 'US',
 'customer_id': '45610553',
 'review_id': 'RMDCHWD0Y5OZ9',
 'product_id': 'B00HH62VB6',
 'product_parent': '618218723',
 'product_title': 'AGPtek® 10 Isolated Output 9V 12V 18V Guitar Pedal Board Power Supply Effect Pedals with Isolated Short Cricuit / Overcurrent Protection',
 'product_category': 'Musical Instruments',
 'star_rating': 3,
 'helpful_votes': 0,
 'total_votes': 1,
 'vine': 'N',
 'verified_purchase': 'N',
 'review_headline': 'Three Stars',
 'review_body': 'Works very good, but induces ALOT of noise.',
 'review_date': '2015-08-31'}

First we'll build a few useful data structures, in this case just to maintain a collection of the items reviewed by each user, and the collection of users who have reviewed each item.

In [9]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)
In [10]:
itemNames = {}
In [11]:
for d in dataset:
    user,item = d['customer_id'], d['product_id']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    itemNames[item] = d['product_title']

This is a generic implementation of the Jaccard similarity between two sets, which we'll use to build our recommender:

In [12]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

Our implementation of the recommender system just finds the most similar item (i2) compared to the query item (i), based on their Jaccard similarities (i.e., overlap between users who purchased both items).

In [13]:
def mostSimilar(i):
    similarities = []
    users = usersPerItem[i]
    for i2 in usersPerItem:
        if i2 == i: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:10]

Generating a recommendation

Let's select some example item from the dataset to use as a query to generate similar recommendations:

In [14]:
dataset[2]
Out[14]:
{'marketplace': 'US',
 'customer_id': '6111003',
 'review_id': 'RIZR67JKUDBI0',
 'product_id': 'B0006VMBHI',
 'product_parent': '603261968',
 'product_title': 'AudioQuest LP record clean brush',
 'product_category': 'Musical Instruments',
 'star_rating': 3,
 'helpful_votes': 0,
 'total_votes': 1,
 'vine': 'N',
 'verified_purchase': 'Y',
 'review_headline': 'Three Stars',
 'review_body': 'removes dust. does not clean',
 'review_date': '2015-08-31'}
In [15]:
query = dataset[2]['product_id']

Next we'll examine the most similar items compared to this query:

In [16]:
mostSimilar(query)
Out[16]:
[(0.028446389496717725, 'B00006I5SD'),
 (0.01694915254237288, 'B00006I5SB'),
 (0.015065913370998116, 'B000AJR482'),
 (0.014204545454545454, 'B00E7MVP3S'),
 (0.008955223880597015, 'B001255YL2'),
 (0.008849557522123894, 'B003EIRVO8'),
 (0.008333333333333333, 'B0015VEZ22'),
 (0.00821917808219178, 'B00006I5UH'),
 (0.008021390374331552, 'B00008BWM7'),
 (0.007656967840735069, 'B000H2BC4E')]
In [17]:
itemNames[query]
Out[17]:
'AudioQuest LP record clean brush'
In [18]:
[itemNames[x[1]] for x in mostSimilar(query)]
Out[18]:
['Shure SFG-2 Stylus Tracking Force Gauge',
 'Shure M97xE High-Performance Magnetic Phono Cartridge',
 'ART Pro Audio DJPRE II Phono Turntable Preamplifier',
 'Signstek Blue LCD Backlight Digital Long-Playing LP Turntable Stylus Force Scale Gauge Tester',
 'Audio Technica AT120E/T Standard Mount Phono Cartridge',
 'Technics: 45 Adaptor for Technics 1200 (SFWE010)',
 'GruvGlide GRUVGLIDE DJ Package',
 'STANTON MAGNETICS Record Cleaner Kit',
 'Shure M97xE High-Performance Magnetic Phono Cartridge',
 'Behringer PP400 Ultra Compact Phono Preamplifier']

Efficient similarity-based recommendation

The above recommender generated reasonably effective recommendations, but was inefficient as it required iteration over all items. We can make this implementation more efficient by considering a smaller candidate set of items to be iterated over, namely we can restrict the search to only those items that could possibly have non-zero Jaccard similarity.

Specifically, this search is based on the set of items that were purchased by any user who purchased the query item i.

In [19]:
def mostSimilarFast(i):
    similarities = []
    users = usersPerItem[i]
    candidateItems = set()
    for u in users:
        candidateItems = candidateItems.union(itemsPerUser[u])
    for i2 in candidateItems:
        if i2 == i: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:10]
In [20]:
mostSimilarFast(query)
Out[20]:
[(0.028446389496717725, 'B00006I5SD'),
 (0.01694915254237288, 'B00006I5SB'),
 (0.015065913370998116, 'B000AJR482'),
 (0.014204545454545454, 'B00E7MVP3S'),
 (0.008955223880597015, 'B001255YL2'),
 (0.008849557522123894, 'B003EIRVO8'),
 (0.008333333333333333, 'B0015VEZ22'),
 (0.00821917808219178, 'B00006I5UH'),
 (0.008021390374331552, 'B00008BWM7'),
 (0.007656967840735069, 'B000H2BC4E')]

Collaborative-filtering-based rating estimation

We can also use the similarity-based recommender we developed above to make predictions about user's ratings. Although this is not an example of machine learning, it is a simple heuristic that can be used to estimate a user's future ratings based on their ratings in the past.

Specifically, a user's rating for an item is assumed to be a weighted sum of their previous ratings, weighted by how similar the query item is to each of their previous purchases.

We start by building a few more utility data structures to keep track of all of the reviews by each user and for each item.

In [21]:
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)
In [22]:
for d in dataset:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)

Next we compute the rating mean. This will be used as a simple baseline, but will also be used as a "default" prediction in the event that the user has rated no previous items with a Jaccard similarity greater than zero (compared to the query).

In [23]:
ratingMean = sum([d['star_rating'] for d in dataset]) / len(dataset)
In [24]:
ratingMean
Out[24]:
4.251102772543146

Our prediction function computes (a) a list of the user's previous ratings (ignoring the query item); and (b) a list of the similarities of these previous items, compared to the query. These weights are used to constructed a weighted average of the ratings from the first set.

In [25]:
def predictRating(user,item):
    ratings = []
    similarities = []
    for d in reviewsPerUser[user]:
        i2 = d['product_id']
        if i2 == item: continue
        ratings.append(d['star_rating'])
        similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))
    if (sum(similarities) > 0):
        weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
        return sum(weightedRatings) / sum(similarities)
    else:
        # User hasn't rated any similar items
        return ratingMean

Let's try a simple example:

In [26]:
dataset[1]
Out[26]:
{'marketplace': 'US',
 'customer_id': '14640079',
 'review_id': 'RZSL0BALIYUNU',
 'product_id': 'B003LRN53I',
 'product_parent': '986692292',
 'product_title': 'Sennheiser HD203 Closed-Back DJ Headphones',
 'product_category': 'Musical Instruments',
 'star_rating': 5,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': 'N',
 'verified_purchase': 'Y',
 'review_headline': 'Five Stars',
 'review_body': 'Nice headphones at a reasonable price.',
 'review_date': '2015-08-31'}
In [27]:
u,i = dataset[1]['customer_id'], dataset[1]['product_id']
In [28]:
predictRating(u, i)
Out[28]:
5.0

Again, we evaluate the performace of our model:

In [29]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)
In [30]:
alwaysPredictMean = [ratingMean for d in dataset]
In [31]:
cfPredictions = [predictRating(d['customer_id'], d['product_id']) for d in dataset]
In [32]:
labels = [d['star_rating'] for d in dataset]
In [33]:
MSE(alwaysPredictMean, labels)
Out[33]:
1.4796142779564334
In [34]:
MSE(cfPredictions, labels)
Out[34]:
1.6146130004291603

In this case, the accuracy of our rating prediction model was actually worse (in terms of the MSE) than just predicting the mean rating. However note again that this is just a heuristic, and could be modified to improve its predictions (e.g. by using a different similarity function other than the Jaccard similarity).

Simple (bias only) latent factor-based recommender

Here we'll use gradient descent to implement a machine-learning-based recommender (a latent-factor model).

This is a fairly difficult exercise, but brings together many of the ideas we've seen previously in this class, especially regarding gradient descent.

First, we build some utility data structures to store the variables of our model (alpha, userBiases, and itemBiases)

In [35]:
N = len(dataset)
nUsers = len(reviewsPerUser)
nItems = len(reviewsPerItem)
users = list(reviewsPerUser.keys())
items = list(reviewsPerItem.keys())
In [36]:
alpha = ratingMean
In [37]:
userBiases = defaultdict(float)
itemBiases = defaultdict(float)

The actual prediction function of our model is simple: Just predict using a global offset (alpha), a user offset (beta_u in the slides), and an item offset (beta_i)

In [38]:
def prediction(user, item):
    return alpha + userBiases[user] + itemBiases[item]

We'll use another library in this example to perform gradient descent. This library requires that we pass it a "flat" parameter vector (theta) containing all of our parameters. This utility function just converts between a flat feature vector, and our model parameters, i.e., it "unpacks" theta into our offset and bias parameters.

In [39]:
def unpack(theta):
    global alpha
    global userBiases
    global itemBiases
    alpha = theta[0]
    userBiases = dict(zip(users, theta[1:nUsers+1]))
    itemBiases = dict(zip(items, theta[1+nUsers:]))

The "cost" function is the function we are trying to optimize. Again this is a requirement of the gradient descent library we'll use. In this case, we're just computing the (regularized) MSE of a particular solution (theta), and returning the cost.

In [40]:
def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(d['customer_id'], d['product_id']) for d in dataset]
    cost = MSE(predictions, labels)
    print("MSE = " + str(cost))
    for u in userBiases:
        cost += lamb*userBiases[u]**2
    for i in itemBiases:
        cost += lamb*itemBiases[i]**2
    return cost

The derivative function is the most difficult to implement, but follows the definitions of the derivatives for this model as given in the lectures. This step could be avoided if using a gradient descent implementation based on (e.g.) Tensorflow.

In [41]:
def derivative(theta, labels, lamb):
    unpack(theta)
    N = len(dataset)
    dalpha = 0
    dUserBiases = defaultdict(float)
    dItemBiases = defaultdict(float)
    for d in dataset:
        u,i = d['customer_id'], d['product_id']
        pred = prediction(u, i)
        diff = pred - d['star_rating']
        dalpha += 2/N*diff
        dUserBiases[u] += 2/N*diff
        dItemBiases[i] += 2/N*diff
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
    for i in itemBiases:
        dItemBiases[i] += 2*lamb*itemBiases[i]
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
    return numpy.array(dtheta)

Compute the MSE of a trivial baseline (always predicting the mean) for comparison:

In [42]:
MSE(alwaysPredictMean, labels)
Out[42]:
1.4796142779564334

Finally, we can run gradient descent. This particular gradient descent library takes as inputs (1) Our cost function (implemented above); (2) Initial parameter values; (3) Our derivative function; and (4) Any additional arguments to be passed to the cost function (in this case the labels and the regularization strength).

In [43]:
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(nUsers+nItems),
                             derivative, args = (labels, 0.001))
MSE = 1.4796142779564334
MSE = 1.468686355953835
MSE = 2.6961687182003145
MSE = 1.468141901849422
MSE = 1.4523523347391138
MSE = 1.451357539727294
MSE = 1.4476987674765516
MSE = 1.4421925605951202
MSE = 1.441526267208805
MSE = 1.4413460037417507
MSE = 1.441397612244048
MSE = 1.4414066017099239
Out[43]:
(array([ 4.24278450e+00, -1.37216332e-03,  5.73953696e-03, ...,
         8.31051679e-04, -2.15966373e-03, -2.67061773e-04]),
 1.4574364057349312,
 {'grad': array([-4.52665785e-07,  8.64056712e-09, -2.76548829e-08, ...,
         -5.06227031e-09,  1.10178064e-08,  1.37544741e-09]),
  'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
  'funcalls': 12,
  'nit': 9,
  'warnflag': 0})

Complete latent factor model

This code extends the example above to implement a complete latent factor model (i.e., including low-dimensional user and item terms).

We first initialize our parameters as before:

In [44]:
alpha = ratingMean
In [45]:
userBiases = defaultdict(float)
itemBiases = defaultdict(float)

For each user and item we now have a low dimensional descriptor (representing that user's preferences, and that item's properties), of dimension K.

In [46]:
userGamma = {}
itemGamma = {}
In [47]:
K = 2
In [48]:
for u in reviewsPerUser:
    userGamma[u] = [random.random() * 0.1 - 0.05 for k in range(K)]
In [49]:
for i in reviewsPerItem:
    itemGamma[i] = [random.random() * 0.1 - 0.05 for k in range(K)]

Again we must implement an "unpack" function. This is the same as before, though has some additional terms.

In [50]:
def unpack(theta):
    global alpha
    global userBiases
    global itemBiases
    global userGamma
    global itemGamma
    index = 0
    alpha = theta[index]
    index += 1
    userBiases = dict(zip(users, theta[index:index+nUsers]))
    index += nUsers
    itemBiases = dict(zip(items, theta[index:index+nItems]))
    index += nItems
    for u in users:
        userGamma[u] = theta[index:index+K]
        index += K
    for i in items:
        itemGamma[i] = theta[index:index+K]
        index += K

Similarly, our cost and derivative functions serve the same role as before, though their implementations are somewhat more complicated.

In [51]:
def inner(x, y):
    return sum([a*b for a,b in zip(x,y)])
In [52]:
def prediction(user, item):
    return alpha + userBiases[user] + itemBiases[item] + inner(userGamma[user], itemGamma[item])
In [53]:
def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(d['customer_id'], d['product_id']) for d in dataset]
    cost = MSE(predictions, labels)
    print("MSE = " + str(cost))
    for u in users:
        cost += lamb*userBiases[u]**2
        for k in range(K):
            cost += lamb*userGamma[u][k]**2
    for i in items:
        cost += lamb*itemBiases[i]**2
        for k in range(K):
            cost += lamb*itemGamma[i][k]**2
    return cost
In [54]:
def derivative(theta, labels, lamb):
    unpack(theta)
    N = len(dataset)
    dalpha = 0
    dUserBiases = defaultdict(float)
    dItemBiases = defaultdict(float)
    dUserGamma = {}
    dItemGamma = {}
    for u in reviewsPerUser:
        dUserGamma[u] = [0.0 for k in range(K)]
    for i in reviewsPerItem:
        dItemGamma[i] = [0.0 for k in range(K)]
    for d in dataset:
        u,i = d['customer_id'], d['product_id']
        pred = prediction(u, i)
        diff = pred - d['star_rating']
        dalpha += 2/N*diff
        dUserBiases[u] += 2/N*diff
        dItemBiases[i] += 2/N*diff
        for k in range(K):
            dUserGamma[u][k] += 2/N*itemGamma[i][k]*diff
            dItemGamma[i][k] += 2/N*userGamma[u][k]*diff
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
        for k in range(K):
            dUserGamma[u][k] += 2*lamb*userGamma[u][k]
    for i in itemBiases:
        dItemBiases[i] += 2*lamb*itemBiases[i]
        for k in range(K):
            dItemGamma[i][k] += 2*lamb*itemGamma[i][k]
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
    for u in users:
        dtheta += dUserGamma[u]
    for i in items:
        dtheta += dItemGamma[i]
    return numpy.array(dtheta)

Again we optimize using our gradient descent library, and compare to a simple baseline.

In [55]:
MSE(alwaysPredictMean, labels)
Out[55]:
1.4796142779564334
In [56]:
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + # Initialize alpha
                                   [0.0]*(nUsers+nItems) + # Initialize beta
                                   [random.random() * 0.1 - 0.05 for k in range(K*(nUsers+nItems))], # Gamma
                             derivative, args = (labels, 0.001))
MSE = 1.4796102344273518
MSE = 1.4775873207802306
MSE = 1.4700799551149453
MSE = 218.84600512731896
MSE = 1.4760189423087369
MSE = 1.4428283988951685
MSE = 1.437013894343459
MSE = 1.438876397641038
MSE = 1.4401877034345696
MSE = 1.4413008607741815
MSE = 1.4413915141128386
MSE = 1.4413770903701413
MSE = 1.4413881749792885
Out[56]:
(array([ 4.24278458e+00, -1.37650512e-03,  5.75500830e-03, ...,
        -4.57904038e-06,  3.76994221e-07,  9.54504554e-08]),
 1.4574363606184177,
 {'grad': array([-4.92749323e-07, -4.82700300e-11,  3.52296374e-09, ...,
         -9.19779828e-09,  7.53786885e-10,  1.90847967e-10]),
  'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
  'funcalls': 13,
  'nit': 10,
  'warnflag': 0})

Note finally that in the above exercise we only computed the training error of our model, i.e., we never confirmed that it works well on held-out (validation/testing) data!