Before we study regression and classification diagnostics, we'll set up a new problem for testing and evaluating our methods. In this case, we'll use model that make predictions (e.g. estimating star ratings), based on the words in a review. This is a challenging problem due to the dimensionality of the features, which can easily lead to overfitting.
First we import a few libraries. Most of these are the same as before, though a few new libraries are included for string processing.
import gzip
from collections import defaultdict
import string # Some string utilities
import random
from nltk.stem.porter import PorterStemmer # Stemming
import numpy
We'll base this example on the Amazon Gift Card data, as used in Course 1.
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
f = gzip.open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
dataset = []
for line in f:
fields = line.strip().split('\t')
d = dict(zip(header, fields))
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
dataset.append(d)
First, let's count the number of unique words in the corpus
wordCount = defaultdict(int)
for d in dataset:
for w in d['review_body'].split():
wordCount[w] += 1
print(len(wordCount))
This number of words is too many to deal with (i.e., it would result in a 97,289 dimensional feature vector if used naively). Next, let's try and reduce this number by removing punctuation and capitalization, so that two instances of the same word will be counted as being the same even if punctuated or capitalized differently.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
for d in dataset:
r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
for w in r.split():
wordCount[w] += 1
print(len(wordCount))
We're still left with a large number of words, so possibly we can reduce this number of words further if we treat different word inflections (e.g. "drinks" vs. "drinking") as being instances of the same word, by identifying their word stem (i.e., "drink"). This process is called "stemming".
We perform stemming using a stemmer from the NLTK (Natural Language Toolkit) library.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
stemmer = PorterStemmer()
for d in dataset:
r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
for w in r.split():
w = stemmer.stem(w) # with stemming
wordCount[w] += 1
print(len(wordCount))
Even after all of the above operations, we were left with too many unique words from which to try and build a feature vector. A simpler but effective approach might be just to take only the most common words and build features out of those words alone.
First we build a few data structures to count the number of instances of each word. Here we remove punctuation and capitalization, but do not apply stemming.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
for d in dataset:
r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
for w in r.split():
wordCount[w] += 1
Having counted the number of instances of each word, we can sort them to find the most common, and build a word index based on these frequencies. For example, the most frequent word will have index 0, the secont most frequent will have index 1, etc.
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()
words = [x[1] for x in counts[:1000]]
wordId = dict(zip(words, range(len(words))))
wordSet = set(words)
Having selected our dictionary of common words, we can now build a feature vector, by counting how often each of these words appears in each review. This is called a "bag-of-words" feature representation. This results in a 1,000 dimensional feature vector (1,001 if we include an offset term).
def feature(datum):
feat = [0]*len(words)
r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
for w in r.split():
if w in wordSet:
feat[wordId[w]] += 1
feat.append(1) #offset
return feat
Having constructed this dataset, we perform training exactly as with previous examples
random.shuffle(dataset)
X = [feature(d) for d in dataset]
y = [d['star_rating'] for d in dataset]
theta,residuals,rank,s = numpy.linalg.lstsq(X, y)
Once the model has finished training, we can examine which are the most positive (or negative) words by looking at the largest (or smallest) corresponding values for theta
To do so, let's sort our words based on their corresponding weights from theta:
wordWeights = list(zip(theta, words + ['offset']))
wordWeights.sort()
These are the 10 most negative words:
wordWeights[:10]
And the 9 most positive (the 10th is the offset term)
wordWeights[-10:]
Although the above model was effective, it was also very high dimensional, and thus may have been prone to overfitting. We can try to address this by adding a regularizer to our model
The "Ridge" class from sklearn implements a least squares regression model (as in our example above) that includes a regularizer. The strength of the regularizer is controlled by the parameter alpha (equivalent to lambda in the course lectures).
from sklearn import linear_model
help(linear_model.Ridge)
Otherwise, fitting the ridge regression model is exactly the same as fitting a regular least squares model. Note the two extra parameters: The first is the regularization strength (alpha), the second indicates that we do not want the model to fit an intercept (since our feature vector already includes one).
model = linear_model.Ridge(1.0, fit_intercept=False)
model.fit(X, y)
Again, we can then examine the coefficients learned by our model.
theta = model.coef_
wordWeights = list(zip(theta, words + ['offset']))
wordWeights.sort()
wordWeights[:10]
wordWeights[-10:]
To evaluate our regressors and classifiers in more detail, below we will introduce several diagnostics for regression and classification.
First we discuss the MSE (which we have already been using), and its relationship to the R^2 statistic.
We start by extracting the predictions from our model:
predictions = model.predict(X)
And computing their squared differences:
differences = [(x-y)**2 for (x,y) in zip(predictions,y)]
The MSE is just the average (Mean) of these squared differences:
MSE = sum(differences) / len(differences)
print("MSE = " + str(MSE))
As we saw in the lectures, the R^2 (and the FVU, or "Fraction of Variance Unexplained") normalize the Mean Squared Error based on the variance of the data:
FVU = MSE / numpy.var(y)
R2 = 1 - FVU
print("R2 = " + str(R2))
To look at some classification diagnostics, we first convert our problem into a classification setting. To do so we simply replace our output variable (the star rating), with a binary variable indicating whether the star rating was greater than 3.
Following this we follow the same procedures as before to generate predictions from our (Logistic Regression) classifier:
y_class = [(rating > 3) for rating in y]
model = linear_model.LogisticRegression()
model.fit(X, y_class)
predictions = model.predict(X)
correct = predictions == y_class
The first and simplest classifier evaluation metric is the accuracy:
accuracy = sum(correct) / len(correct)
print("Accuracy = " + str(accuracy))
To compute more detailed diagnostics, we first compute the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
TP = sum([(p and l) for (p,l) in zip(predictions, y_class)])
FP = sum([(p and not l) for (p,l) in zip(predictions, y_class)])
TN = sum([(not p and not l) for (p,l) in zip(predictions, y_class)])
FN = sum([(not p and l) for (p,l) in zip(predictions, y_class)])
print("TP = " + str(TP))
print("FP = " + str(FP))
print("TN = " + str(TN))
print("FN = " + str(FN))
From these we can re-compute the accuracy:
(TP + TN) / (TP + FP + TN + FN)
As well as the true positive rate, and true negative rate:
TPR = TP / (TP + FN)
TNR = TN / (TN + FP)
Finally, we can compute the Balanced Error Rate (BER), which balances true positives and false negatives.
BER = 1 - 1/2 * (TPR + TNR)
print("Balanced error rate = " + str(BER))
Next we can compute ranking-based evaluation measures, like the precision, recall, and F1 scores.
Precision and recall can be defined in terms of the number of true positives, false positives, and false negatives:
precision = TP / (TP + FP)
recall = TP / (TP + FN)
precision, recall
The F1 score is just the average (precisely, the harmonic mean) of precision and recall. This is useful since it's easy to have either a good precision, or a good recall in isolation, but it's hard for both values to be high simultaneously.
F1 = 2 * (precision*recall) / (precision + recall)
F1
All of the models we've seen so far (regression, ridge regression, logistic regression, etc.) are capable of outputting confidence scores along with their predictions. We can use these scores to rank the model's output from most to least confident.
From the documentation, we see that "decision function" will generate confidence scores for the model. Essentially, this function simply outputs the value of X.theta.
help(model)
confidences = model.decision_function(X)
confidences
In particular, we are interested in whether the positive labels are assigned high confidence (i.e., positive instances should appear near the top of the ranking). To determine this, we sort the labels according to their confidence scores:
confidencesAndLabels = list(zip(confidences,y_class))
confidencesAndLabels
confidencesAndLabels.sort()
confidencesAndLabels.reverse()
confidencesAndLabels
Once we've obtained the sorted labels, we can discard the confidences - only the labels themselves matter in terms of our evaluation metrics.
labelsRankedByConfidence = [z[1] for z in confidencesAndLabels]
labelsRankedByConfidence
Precision@K measures which of the Top K entries in our ranked list actually had positive labels:
def precisionAtK(K, y_sorted):
return sum(y_sorted[:K]) / K
While Recall@K measures how many out of all positively labeled instances we returned among the Top K:
def recallAtK(K, y_sorted):
return sum(y_sorted[:K]) / sum(y_sorted)
precisionAtK(50, labelsRankedByConfidence)
precisionAtK(1000, labelsRankedByConfidence)
precisionAtK(10000, labelsRankedByConfidence)
recallAtK(50, labelsRankedByConfidence)
recallAtK(1000, labelsRankedByConfidence)
recallAtK(10000, labelsRankedByConfidence)
To combine our ideas about training, evaluation, regularization, and overfitting, we'll next try to implement a complete training, testing, and validation pipeline.
We start by importing libraries and reading in our data, just as before:
import gzip
from collections import defaultdict
import string
import random
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
f = gzip.open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
dataset = []
for line in f:
fields = line.strip().split('\t')
d = dict(zip(header, fields))
d['star_rating'] = int(d['star_rating'])
d['helpful_votes'] = int(d['helpful_votes'])
d['total_votes'] = int(d['total_votes'])
dataset.append(d)
And again we build word featres based on the 1,000 most common words
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
for d in dataset:
r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
for w in r.split():
wordCount[w] += 1
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()
words = [x[1] for x in counts[:1000]]
wordId = dict(zip(words, range(len(words))))
wordSet = set(words)
def feature(datum):
feat = [0]*len(words)
r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
for w in r.split():
if w in words:
feat[wordId[w]] += 1
feat.append(1) #offset
return feat
Again we split our data, but this time we split it into three parts - a training, a validation, and a test component.
random.shuffle(dataset)
X = [feature(d) for d in dataset]
y = [d['star_rating'] for d in dataset]
In this example we use half of our data (and labels) for training, the next quarter for validation, and the final quarter for testing.
N = len(X)
X_train = X[:N//2]
X_valid = X[N//2:3*N//4]
X_test = X[3*N//4:]
y_train = y[:N//2]
y_valid = y[N//2:3*N//4]
y_test = y[3*N//4:]
len(X), len(X_train), len(X_valid), len(X_test)
Again we'll train a model based on regularized (Ridge) regression.
from sklearn import linear_model
help(linear_model.Ridge)
Our MSE function is the same as before, but this time for convience takes an (already trained) model as an input parameter.
def MSE(model, X, y):
predictions = model.predict(X)
differences = [(a-b)**2 for (a,b) in zip(predictions, y)]
return sum(differences) / len(differences)
Finally, to implement the pipeline, we (1) iterate through various values of lambda; (2) Fit a ridge regression model for each of these values; (3) Evaluate the performance of this model on the validation set; (4) Keep track of which model is the best we've seen so far (on the validation set);
bestModel = None
bestMSE = None
for lamb in [0.01, 0.1, 1, 10, 100]:
model = linear_model.Ridge(lamb, fit_intercept=False)
model.fit(X_train, y_train)
mseTrain = MSE(model, X_train, y_train)
mseValid = MSE(model, X_valid, y_valid)
print("lambda = " + str(lamb) + ", training/validation error = " +
str(mseTrain) + '/' + str(mseValid))
if not bestModel or mseValid < bestMSE:
bestModel = model
bestMSE = mseValid
Finally, we evaluate the performance of our best model on the test set. Note that this is the only time throughout the entire pipeline that we examine the test data.
mseTest = MSE(bestModel, X_test, y_test)
print("test error = " + str(mseTest))