The first recommender system we'll implement is a simple simaliry-based recommender that makes recommendations based on the Jaccard similarity between items.
We'll start with several standard imports:
import gzip
from collections import defaultdict
import scipy
import scipy.optimize
import numpy
import random
And load the data much as before, converting integer-valued fields as we go. Note that here we use a larger "Musical Instruments" dataset for the sake of demonstrating a more scalable system, but you could repeat this exercise with any dataset (in particular, you could use a smaller one if these exercises run slowly on your machine). Again, this dataset was collected from https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt and should be saved locally.
path = "/Users/johnn/Dropbox/datasets/amazon_reviews_us_Musical_Instruments_v1_00.tsv.gz"
f = gzip.open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
dataset = []
for line in f:
fields = line.strip().split('\t')
d = dict(zip(header, fields))
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
dataset.append(d)
Let's examine one of the entries in this dataset:
dataset[0]
First we'll build a few useful data structures, in this case just to maintain a collection of the items reviewed by each user, and the collection of users who have reviewed each item.
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)
itemNames = {}
for d in dataset:
user,item = d['customer_id'], d['product_id']
usersPerItem[item].add(user)
itemsPerUser[user].add(item)
itemNames[item] = d['product_title']
This is a generic implementation of the Jaccard similarity between two sets, which we'll use to build our recommender:
def Jaccard(s1, s2):
numer = len(s1.intersection(s2))
denom = len(s1.union(s2))
return numer / denom
Our implementation of the recommender system just finds the most similar item (i2) compared to the query item (i), based on their Jaccard similarities (i.e., overlap between users who purchased both items).
def mostSimilar(i):
similarities = []
users = usersPerItem[i]
for i2 in usersPerItem:
if i2 == i: continue
sim = Jaccard(users, usersPerItem[i2])
similarities.append((sim,i2))
similarities.sort(reverse=True)
return similarities[:10]
Let's select some example item from the dataset to use as a query to generate similar recommendations:
dataset[2]
query = dataset[2]['product_id']
Next we'll examine the most similar items compared to this query:
mostSimilar(query)
itemNames[query]
[itemNames[x[1]] for x in mostSimilar(query)]
The above recommender generated reasonably effective recommendations, but was inefficient as it required iteration over all items. We can make this implementation more efficient by considering a smaller candidate set of items to be iterated over, namely we can restrict the search to only those items that could possibly have non-zero Jaccard similarity.
Specifically, this search is based on the set of items that were purchased by any user who purchased the query item i.
def mostSimilarFast(i):
similarities = []
users = usersPerItem[i]
candidateItems = set()
for u in users:
candidateItems = candidateItems.union(itemsPerUser[u])
for i2 in candidateItems:
if i2 == i: continue
sim = Jaccard(users, usersPerItem[i2])
similarities.append((sim,i2))
similarities.sort(reverse=True)
return similarities[:10]
mostSimilarFast(query)
We can also use the similarity-based recommender we developed above to make predictions about user's ratings. Although this is not an example of machine learning, it is a simple heuristic that can be used to estimate a user's future ratings based on their ratings in the past.
Specifically, a user's rating for an item is assumed to be a weighted sum of their previous ratings, weighted by how similar the query item is to each of their previous purchases.
We start by building a few more utility data structures to keep track of all of the reviews by each user and for each item.
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)
for d in dataset:
user,item = d['customer_id'], d['product_id']
reviewsPerUser[user].append(d)
reviewsPerItem[item].append(d)
Next we compute the rating mean. This will be used as a simple baseline, but will also be used as a "default" prediction in the event that the user has rated no previous items with a Jaccard similarity greater than zero (compared to the query).
ratingMean = sum([d['star_rating'] for d in dataset]) / len(dataset)
ratingMean
Our prediction function computes (a) a list of the user's previous ratings (ignoring the query item); and (b) a list of the similarities of these previous items, compared to the query. These weights are used to constructed a weighted average of the ratings from the first set.
def predictRating(user,item):
ratings = []
similarities = []
for d in reviewsPerUser[user]:
i2 = d['product_id']
if i2 == item: continue
ratings.append(d['star_rating'])
similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))
if (sum(similarities) > 0):
weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
return sum(weightedRatings) / sum(similarities)
else:
# User hasn't rated any similar items
return ratingMean
Let's try a simple example:
dataset[1]
u,i = dataset[1]['customer_id'], dataset[1]['product_id']
predictRating(u, i)
Again, we evaluate the performace of our model:
def MSE(predictions, labels):
differences = [(x-y)**2 for x,y in zip(predictions,labels)]
return sum(differences) / len(differences)
alwaysPredictMean = [ratingMean for d in dataset]
cfPredictions = [predictRating(d['customer_id'], d['product_id']) for d in dataset]
labels = [d['star_rating'] for d in dataset]
MSE(alwaysPredictMean, labels)
MSE(cfPredictions, labels)
In this case, the accuracy of our rating prediction model was actually worse (in terms of the MSE) than just predicting the mean rating. However note again that this is just a heuristic, and could be modified to improve its predictions (e.g. by using a different similarity function other than the Jaccard similarity).
Here we'll use gradient descent to implement a machine-learning-based recommender (a latent-factor model).
This is a fairly difficult exercise, but brings together many of the ideas we've seen previously in this class, especially regarding gradient descent.
First, we build some utility data structures to store the variables of our model (alpha, userBiases, and itemBiases)
N = len(dataset)
nUsers = len(reviewsPerUser)
nItems = len(reviewsPerItem)
users = list(reviewsPerUser.keys())
items = list(reviewsPerItem.keys())
alpha = ratingMean
userBiases = defaultdict(float)
itemBiases = defaultdict(float)
The actual prediction function of our model is simple: Just predict using a global offset (alpha), a user offset (beta_u in the slides), and an item offset (beta_i)
def prediction(user, item):
return alpha + userBiases[user] + itemBiases[item]
We'll use another library in this example to perform gradient descent. This library requires that we pass it a "flat" parameter vector (theta) containing all of our parameters. This utility function just converts between a flat feature vector, and our model parameters, i.e., it "unpacks" theta into our offset and bias parameters.
def unpack(theta):
global alpha
global userBiases
global itemBiases
alpha = theta[0]
userBiases = dict(zip(users, theta[1:nUsers+1]))
itemBiases = dict(zip(items, theta[1+nUsers:]))
The "cost" function is the function we are trying to optimize. Again this is a requirement of the gradient descent library we'll use. In this case, we're just computing the (regularized) MSE of a particular solution (theta), and returning the cost.
def cost(theta, labels, lamb):
unpack(theta)
predictions = [prediction(d['customer_id'], d['product_id']) for d in dataset]
cost = MSE(predictions, labels)
print("MSE = " + str(cost))
for u in userBiases:
cost += lamb*userBiases[u]**2
for i in itemBiases:
cost += lamb*itemBiases[i]**2
return cost
The derivative function is the most difficult to implement, but follows the definitions of the derivatives for this model as given in the lectures. This step could be avoided if using a gradient descent implementation based on (e.g.) Tensorflow.
def derivative(theta, labels, lamb):
unpack(theta)
N = len(dataset)
dalpha = 0
dUserBiases = defaultdict(float)
dItemBiases = defaultdict(float)
for d in dataset:
u,i = d['customer_id'], d['product_id']
pred = prediction(u, i)
diff = pred - d['star_rating']
dalpha += 2/N*diff
dUserBiases[u] += 2/N*diff
dItemBiases[i] += 2/N*diff
for u in userBiases:
dUserBiases[u] += 2*lamb*userBiases[u]
for i in itemBiases:
dItemBiases[i] += 2*lamb*itemBiases[i]
dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
return numpy.array(dtheta)
Compute the MSE of a trivial baseline (always predicting the mean) for comparison:
MSE(alwaysPredictMean, labels)
Finally, we can run gradient descent. This particular gradient descent library takes as inputs (1) Our cost function (implemented above); (2) Initial parameter values; (3) Our derivative function; and (4) Any additional arguments to be passed to the cost function (in this case the labels and the regularization strength).
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + [0.0]*(nUsers+nItems),
derivative, args = (labels, 0.001))
This code extends the example above to implement a complete latent factor model (i.e., including low-dimensional user and item terms).
We first initialize our parameters as before:
alpha = ratingMean
userBiases = defaultdict(float)
itemBiases = defaultdict(float)
For each user and item we now have a low dimensional descriptor (representing that user's preferences, and that item's properties), of dimension K.
userGamma = {}
itemGamma = {}
K = 2
for u in reviewsPerUser:
userGamma[u] = [random.random() * 0.1 - 0.05 for k in range(K)]
for i in reviewsPerItem:
itemGamma[i] = [random.random() * 0.1 - 0.05 for k in range(K)]
Again we must implement an "unpack" function. This is the same as before, though has some additional terms.
def unpack(theta):
global alpha
global userBiases
global itemBiases
global userGamma
global itemGamma
index = 0
alpha = theta[index]
index += 1
userBiases = dict(zip(users, theta[index:index+nUsers]))
index += nUsers
itemBiases = dict(zip(items, theta[index:index+nItems]))
index += nItems
for u in users:
userGamma[u] = theta[index:index+K]
index += K
for i in items:
itemGamma[i] = theta[index:index+K]
index += K
Similarly, our cost and derivative functions serve the same role as before, though their implementations are somewhat more complicated.
def inner(x, y):
return sum([a*b for a,b in zip(x,y)])
def prediction(user, item):
return alpha + userBiases[user] + itemBiases[item] + inner(userGamma[user], itemGamma[item])
def cost(theta, labels, lamb):
unpack(theta)
predictions = [prediction(d['customer_id'], d['product_id']) for d in dataset]
cost = MSE(predictions, labels)
print("MSE = " + str(cost))
for u in users:
cost += lamb*userBiases[u]**2
for k in range(K):
cost += lamb*userGamma[u][k]**2
for i in items:
cost += lamb*itemBiases[i]**2
for k in range(K):
cost += lamb*itemGamma[i][k]**2
return cost
def derivative(theta, labels, lamb):
unpack(theta)
N = len(dataset)
dalpha = 0
dUserBiases = defaultdict(float)
dItemBiases = defaultdict(float)
dUserGamma = {}
dItemGamma = {}
for u in reviewsPerUser:
dUserGamma[u] = [0.0 for k in range(K)]
for i in reviewsPerItem:
dItemGamma[i] = [0.0 for k in range(K)]
for d in dataset:
u,i = d['customer_id'], d['product_id']
pred = prediction(u, i)
diff = pred - d['star_rating']
dalpha += 2/N*diff
dUserBiases[u] += 2/N*diff
dItemBiases[i] += 2/N*diff
for k in range(K):
dUserGamma[u][k] += 2/N*itemGamma[i][k]*diff
dItemGamma[i][k] += 2/N*userGamma[u][k]*diff
for u in userBiases:
dUserBiases[u] += 2*lamb*userBiases[u]
for k in range(K):
dUserGamma[u][k] += 2*lamb*userGamma[u][k]
for i in itemBiases:
dItemBiases[i] += 2*lamb*itemBiases[i]
for k in range(K):
dItemGamma[i][k] += 2*lamb*itemGamma[i][k]
dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
for u in users:
dtheta += dUserGamma[u]
for i in items:
dtheta += dItemGamma[i]
return numpy.array(dtheta)
Again we optimize using our gradient descent library, and compare to a simple baseline.
MSE(alwaysPredictMean, labels)
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + # Initialize alpha
[0.0]*(nUsers+nItems) + # Initialize beta
[random.random() * 0.1 - 0.05 for k in range(K*(nUsers+nItems))], # Gamma
derivative, args = (labels, 0.001))
Note finally that in the above exercise we only computed the training error of our model, i.e., we never confirmed that it works well on held-out (validation/testing) data!