Data setup - finding the most common words

Before we study regression and classification diagnostics, we'll set up a new problem for testing and evaluating our methods. In this case, we'll use model that make predictions (e.g. estimating star ratings), based on the words in a review. This is a challenging problem due to the dimensionality of the features, which can easily lead to overfitting.

First we import a few libraries. Most of these are the same as before, though a few new libraries are included for string processing.

In [1]:
import gzip
from collections import defaultdict
import string # Some string utilities
import random
from nltk.stem.porter import PorterStemmer # Stemming
import numpy
The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.

We'll base this example on the Amazon Gift Card data, as used in Course 1.

In [2]:
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
In [3]:
f = gzip.open(path, 'rt', encoding="utf8")
In [4]:
header = f.readline()
header = header.strip().split('\t')
In [5]:
dataset = []
In [6]:
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

Counting words

First, let's count the number of unique words in the corpus

In [7]:
wordCount = defaultdict(int)
for d in dataset:
    for w in d['review_body'].split():
        wordCount[w] += 1

print(len(wordCount))
97289

This number of words is too many to deal with (i.e., it would result in a 97,289 dimensional feature vector if used naively). Next, let's try and reduce this number by removing punctuation and capitalization, so that two instances of the same word will be counted as being the same even if punctuated or capitalized differently.

In [8]:
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
for d in dataset:
    r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
    for w in r.split():
        wordCount[w] += 1

print(len(wordCount))
46283

We're still left with a large number of words, so possibly we can reduce this number of words further if we treat different word inflections (e.g. "drinks" vs. "drinking") as being instances of the same word, by identifying their word stem (i.e., "drink"). This process is called "stemming".

We perform stemming using a stemmer from the NLTK (Natural Language Toolkit) library.

In [9]:
wordCount = defaultdict(int)
punctuation = set(string.punctuation)
stemmer = PorterStemmer()
for d in dataset:
    r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
    for w in r.split():
        w = stemmer.stem(w) # with stemming
        wordCount[w] += 1

print(len(wordCount))
37480

Extracting and building features from the most common words

Even after all of the above operations, we were left with too many unique words from which to try and build a feature vector. A simpler but effective approach might be just to take only the most common words and build features out of those words alone.

First we build a few data structures to count the number of instances of each word. Here we remove punctuation and capitalization, but do not apply stemming.

In [10]:
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

for d in dataset:
  r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
  for w in r.split():
    wordCount[w] += 1

Having counted the number of instances of each word, we can sort them to find the most common, and build a word index based on these frequencies. For example, the most frequent word will have index 0, the secont most frequent will have index 1, etc.

In [11]:
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

words = [x[1] for x in counts[:1000]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

Building our bag-of-words features

Having selected our dictionary of common words, we can now build a feature vector, by counting how often each of these words appears in each review. This is called a "bag-of-words" feature representation. This results in a 1,000 dimensional feature vector (1,001 if we include an offset term).

In [12]:
def feature(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    feat.append(1) #offset
    return feat

Having constructed this dataset, we perform training exactly as with previous examples

In [13]:
random.shuffle(dataset)
In [14]:
X = [feature(d) for d in dataset]
In [15]:
y = [d['star_rating'] for d in dataset]
In [16]:
theta,residuals,rank,s = numpy.linalg.lstsq(X, y)
/usr/local/lib/python3.4/dist-packages/ipykernel_launcher.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  """Entry point for launching an IPython kernel.

Visualizing important words

Once the model has finished training, we can examine which are the most positive (or negative) words by looking at the largest (or smallest) corresponding values for theta

To do so, let's sort our words based on their corresponding weights from theta:

In [17]:
wordWeights = list(zip(theta, words + ['offset']))
wordWeights.sort()

These are the 10 most negative words:

In [18]:
wordWeights[:10]
Out[18]:
[(-1.215456507271758, 'disappointing'),
 (-0.857472017273877, 'disappointed'),
 (-0.7905359349220556, 'unable'),
 (-0.6808380275904115, 'waste'),
 (-0.6634111839366602, 'charged'),
 (-0.5391972452016601, 'supposed'),
 (-0.5292354754787616, 'unfortunately'),
 (-0.4973963444619466, 'australia'),
 (-0.4964445645712919, 'tried'),
 (-0.47776774788212434, 'wont')]

And the 9 most positive (the 10th is the offset term)

In [19]:
wordWeights[-10:]
Out[19]:
[(0.23601901891939855, 'whats'),
 (0.2383326014985241, 'problems'),
 (0.24436555664649356, 'particular'),
 (0.24700474913779377, 'worry'),
 (0.25361200245640303, 'exelente'),
 (0.2597913183314595, 'excelent'),
 (0.27148055221479017, 'excelente'),
 (0.31038243174660535, 'beat'),
 (0.32688983177875636, 'expire'),
 (4.658200085737054, 'offset')]

Including a regularizer in our model

Although the above model was effective, it was also very high dimensional, and thus may have been prone to overfitting. We can try to address this by adding a regularizer to our model

The "Ridge" class from sklearn implements a least squares regression model (as in our example above) that includes a regularizer. The strength of the regularizer is controlled by the parameter alpha (equivalent to lambda in the course lectures).

In [20]:
from sklearn import linear_model
In [21]:
help(linear_model.Ridge)
Help on class Ridge in module sklearn.linear_model.ridge:

class Ridge(_BaseRidge, sklearn.base.RegressorMixin)
 |  Linear least squares with l2 regularization.
 |  
 |  This model solves a regression model where the loss function is
 |  the linear least squares function and regularization is given by
 |  the l2-norm. Also known as Ridge Regression or Tikhonov regularization.
 |  This estimator has built-in support for multi-variate regression
 |  (i.e., when y is a 2d-array of shape [n_samples, n_targets]).
 |  
 |  Read more in the :ref:`User Guide <ridge_regression>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : {float, array-like}, shape (n_targets)
 |      Small positive values of alpha improve the conditioning of the problem
 |      and reduce the variance of the estimates.  Alpha corresponds to
 |      ``C^-1`` in other linear models such as LogisticRegression or
 |      LinearSVC. If an array is passed, penalties are assumed to be specific
 |      to the targets. Hence they must correspond in number.
 |  
 |  copy_X : boolean, optional, default True
 |      If True, X will be copied; else, it may be overwritten.
 |  
 |  fit_intercept : boolean
 |      Whether to calculate the intercept for this model. If set
 |      to false, no intercept will be used in calculations
 |      (e.g. data is expected to be already centered).
 |  
 |  max_iter : int, optional
 |      Maximum number of iterations for conjugate gradient solver.
 |      For 'sparse_cg' and 'lsqr' solvers, the default value is determined
 |      by scipy.sparse.linalg. For 'sag' solver, the default value is 1000.
 |  
 |  normalize : boolean, optional, default False
 |      If True, the regressors X will be normalized before regression.
 |  
 |  solver : {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag'}
 |      Solver to use in the computational routines:
 |  
 |      - 'auto' chooses the solver automatically based on the type of data.
 |  
 |      - 'svd' uses a Singular Value Decomposition of X to compute the Ridge
 |        coefficients. More stable for singular matrices than
 |        'cholesky'.
 |  
 |      - 'cholesky' uses the standard scipy.linalg.solve function to
 |        obtain a closed-form solution.
 |  
 |      - 'sparse_cg' uses the conjugate gradient solver as found in
 |        scipy.sparse.linalg.cg. As an iterative algorithm, this solver is
 |        more appropriate than 'cholesky' for large-scale data
 |        (possibility to set `tol` and `max_iter`).
 |  
 |      - 'lsqr' uses the dedicated regularized least-squares routine
 |        scipy.sparse.linalg.lsqr. It is the fatest but may not be available
 |        in old scipy versions. It also uses an iterative procedure.
 |  
 |      - 'sag' uses a Stochastic Average Gradient descent. It also uses an
 |        iterative procedure, and is often faster than other solvers when
 |        both n_samples and n_features are large. Note that 'sag' fast
 |        convergence is only guaranteed on features with approximately the
 |        same scale. You can preprocess the data with a scaler from
 |        sklearn.preprocessing.
 |  
 |      All last four solvers support both dense and sparse data. However,
 |      only 'sag' supports sparse input when `fit_intercept` is True.
 |  
 |      .. versionadded:: 0.17
 |         Stochastic Average Gradient descent solver.
 |  
 |  tol : float
 |      Precision of the solution.
 |  
 |  random_state : int seed, RandomState instance, or None (default)
 |      The seed of the pseudo random number generator to use when
 |      shuffling the data. Used in 'sag' solver.
 |  
 |      .. versionadded:: 0.17
 |         *random_state* to support Stochastic Average Gradient.
 |  
 |  Attributes
 |  ----------
 |  coef_ : array, shape (n_features,) or (n_targets, n_features)
 |      Weight vector(s).
 |  
 |  intercept_ : float | array, shape = (n_targets,)
 |      Independent term in decision function. Set to 0.0 if
 |      ``fit_intercept = False``.
 |  
 |  n_iter_ : array or None, shape (n_targets,)
 |      Actual number of iterations for each target. Available only for
 |      sag and lsqr solvers. Other solvers will return None.
 |  
 |  See also
 |  --------
 |  RidgeClassifier, RidgeCV, KernelRidge
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.linear_model import Ridge
 |  >>> import numpy as np
 |  >>> n_samples, n_features = 10, 5
 |  >>> np.random.seed(0)
 |  >>> y = np.random.randn(n_samples)
 |  >>> X = np.random.randn(n_samples, n_features)
 |  >>> clf = Ridge(alpha=1.0)
 |  >>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE
 |  Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
 |        normalize=False, random_state=None, solver='auto', tol=0.001)
 |  
 |  Method resolution order:
 |      Ridge
 |      _BaseRidge
 |      abc.NewBase
 |      sklearn.linear_model.base.LinearModel
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.base.RegressorMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit Ridge regression model
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = [n_samples, n_features]
 |          Training data
 |      
 |      y : array-like, shape = [n_samples] or [n_samples, n_targets]
 |          Target values
 |      
 |      sample_weight : float or numpy array of shape [n_samples]
 |          Individual weights for each sample
 |      
 |      Returns
 |      -------
 |      self : returns an instance of self.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.linear_model.base.LinearModel:
 |  
 |  decision_function(*args, **kwargs)
 |      DEPRECATED:  and will be removed in 0.19.
 |      
 |      Decision function of the linear model.
 |      
 |              Parameters
 |              ----------
 |              X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |                  Samples.
 |      
 |              Returns
 |              -------
 |              C : array, shape = (n_samples,)
 |                  Returns predicted values.
 |  
 |  predict(self, X)
 |      Predict using the linear model
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape = (n_samples,)
 |          Returns predicted values.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.RegressorMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the coefficient of determination R^2 of the prediction.
 |      
 |      The coefficient R^2 is defined as (1 - u/v), where u is the regression
 |      sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual
 |      sum of squares ((y_true - y_true.mean()) ** 2).sum().
 |      Best possible score is 1.0 and it can be negative (because the
 |      model can be arbitrarily worse). A constant model that always
 |      predicts the expected value of y, disregarding the input features,
 |      would get a R^2 score of 0.0.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True values for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          R^2 of self.predict(X) wrt. y.

Otherwise, fitting the ridge regression model is exactly the same as fitting a regular least squares model. Note the two extra parameters: The first is the regularization strength (alpha), the second indicates that we do not want the model to fit an intercept (since our feature vector already includes one).

In [22]:
model = linear_model.Ridge(1.0, fit_intercept=False)
model.fit(X, y)
Out[22]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=False, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

Again, we can then examine the coefficients learned by our model.

In [23]:
theta = model.coef_
In [24]:
wordWeights = list(zip(theta, words + ['offset']))
wordWeights.sort()
In [25]:
wordWeights[:10]
Out[25]:
[(-1.210460068900258, 'disappointing'),
 (-0.856640197404887, 'disappointed'),
 (-0.7889876776526171, 'unable'),
 (-0.6787442786286616, 'waste'),
 (-0.6621805930969973, 'charged'),
 (-0.5383370441665155, 'supposed'),
 (-0.5275057765332417, 'unfortunately'),
 (-0.49621911813910957, 'tried'),
 (-0.49620102935875465, 'australia'),
 (-0.47713054138625355, 'wont')]
In [26]:
wordWeights[-10:]
Out[26]:
[(0.23572772781658063, 'whats'),
 (0.23820563310858286, 'problems'),
 (0.24343075578690235, 'particular'),
 (0.24673282091840637, 'worry'),
 (0.2530654258503279, 'exelente'),
 (0.25972864184701644, 'excelent'),
 (0.27148796537139874, 'excelente'),
 (0.3094591548207183, 'beat'),
 (0.32586797936409373, 'expire'),
 (4.658102178230557, 'offset')]

Regression diagnostics: MSE and R^2

To evaluate our regressors and classifiers in more detail, below we will introduce several diagnostics for regression and classification.

First we discuss the MSE (which we have already been using), and its relationship to the R^2 statistic.

We start by extracting the predictions from our model:

In [27]:
predictions = model.predict(X)

And computing their squared differences:

In [28]:
differences = [(x-y)**2 for (x,y) in zip(predictions,y)]

The MSE is just the average (Mean) of these squared differences:

In [29]:
MSE = sum(differences) / len(differences)
print("MSE = " + str(MSE))
MSE = 0.4260065431778635

As we saw in the lectures, the R^2 (and the FVU, or "Fraction of Variance Unexplained") normalize the Mean Squared Error based on the variance of the data:

In [30]:
FVU = MSE / numpy.var(y)
R2 = 1 - FVU
print("R2 = " + str(R2))
R2 = 0.38057450510836266

Classification accuracy measures

To look at some classification diagnostics, we first convert our problem into a classification setting. To do so we simply replace our output variable (the star rating), with a binary variable indicating whether the star rating was greater than 3.

Following this we follow the same procedures as before to generate predictions from our (Logistic Regression) classifier:

In [31]:
y_class = [(rating > 3) for rating in y]
In [32]:
model = linear_model.LogisticRegression()
model.fit(X, y_class)
Out[32]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [33]:
predictions = model.predict(X)
In [34]:
correct = predictions == y_class

Classification Diagnostics: Accuracy

The first and simplest classifier evaluation metric is the accuracy:

In [35]:
accuracy = sum(correct) / len(correct)
print("Accuracy = " + str(accuracy))
Accuracy = 0.9627932870960385

True positives, false positives, and balanced error rate

To compute more detailed diagnostics, we first compute the number of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

In [36]:
TP = sum([(p and l) for (p,l) in zip(predictions, y_class)])
FP = sum([(p and not l) for (p,l) in zip(predictions, y_class)])
TN = sum([(not p and not l) for (p,l) in zip(predictions, y_class)])
FN = sum([(not p and l) for (p,l) in zip(predictions, y_class)])
In [37]:
print("TP = " + str(TP))
print("FP = " + str(FP))
print("TN = " + str(TN))
print("FN = " + str(FN))
TP = 138467
FP = 4446
TN = 5072
FN = 1101

From these we can re-compute the accuracy:

In [38]:
(TP + TN) / (TP + FP + TN + FN)
Out[38]:
0.9627932870960385

As well as the true positive rate, and true negative rate:

In [39]:
TPR = TP / (TP + FN)
TNR = TN / (TN + FP)

Finally, we can compute the Balanced Error Rate (BER), which balances true positives and false negatives.

In [40]:
BER = 1 - 1/2 * (TPR + TNR)
print("Balanced error rate = " + str(BER))
Balanced error rate = 0.237501783939573

Ranking-based performance: Precision and Recall

Next we can compute ranking-based evaluation measures, like the precision, recall, and F1 scores.

Precision and recall can be defined in terms of the number of true positives, false positives, and false negatives:

In [41]:
precision = TP / (TP + FP)
In [42]:
recall = TP / (TP + FN)
In [43]:
precision, recall
Out[43]:
(0.9688901639458971, 0.9921113722343231)

The F1 score is just the average (precisely, the harmonic mean) of precision and recall. This is useful since it's easy to have either a good precision, or a good recall in isolation, but it's hard for both values to be high simultaneously.

In [44]:
F1 = 2 * (precision*recall) / (precision + recall)
In [45]:
F1
Out[45]:
0.9803632810702313

Using confidence scores for ranking: precision@k and recall@k

All of the models we've seen so far (regression, ridge regression, logistic regression, etc.) are capable of outputting confidence scores along with their predictions. We can use these scores to rank the model's output from most to least confident.

From the documentation, we see that "decision function" will generate confidence scores for the model. Essentially, this function simply outputs the value of X.theta.

In [46]:
help(model)
Help on LogisticRegression in module sklearn.linear_model.logistic object:

class LogisticRegression(sklearn.base.BaseEstimator, sklearn.linear_model.base.LinearClassifierMixin, sklearn.feature_selection.from_model._LearntSelectorMixin, sklearn.linear_model.base.SparseCoefMixin)
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr' and uses the
 |  cross-entropy loss, if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs' and
 |  'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  `liblinear` library, newton-cg and lbfgs solvers. It can handle both
 |  dense and sparse input. Use C-ordered arrays or CSR matrices containing
 |  64-bit floats for optimal performance; any other input format will be
 |  converted (and copied).
 |  
 |  The newton-cg and lbfgs solvers support only L2 regularization with primal
 |  formulation. The liblinear solver supports both L1 and L2 regularization,
 |  with a dual formulation only for the L2 penalty.
 |  
 |  Read more in the :ref:`User Guide <logistic_regression>`.
 |  
 |  Parameters
 |  ----------
 |  penalty : str, 'l1' or 'l2'
 |      Used to specify the norm used in the penalization. The newton-cg and
 |      lbfgs solvers support only l2 penalties.
 |  
 |  dual : bool
 |      Dual or primal formulation. Dual formulation is only implemented for
 |      l2 penalty with liblinear solver. Prefer dual=False when
 |      n_samples > n_features.
 |  
 |  C : float, optional (default=1.0)
 |      Inverse of regularization strength; must be a positive float.
 |      Like in support vector machines, smaller values specify stronger
 |      regularization.
 |  
 |  fit_intercept : bool, default: True
 |      Specifies if a constant (a.k.a. bias or intercept) should be
 |      added to the decision function.
 |  
 |  intercept_scaling : float, default: 1
 |      Useful only if solver is liblinear.
 |      when self.fit_intercept is True, instance vector x becomes
 |      [x, self.intercept_scaling],
 |      i.e. a "synthetic" feature with constant value equals to
 |      intercept_scaling is appended to the instance vector.
 |      The intercept becomes intercept_scaling * synthetic feature weight
 |      Note! the synthetic feature weight is subject to l1/l2 regularization
 |      as all other features.
 |      To lessen the effect of regularization on synthetic feature weight
 |      (and therefore on the intercept) intercept_scaling has to be increased.
 |  
 |  class_weight : dict or 'balanced', optional
 |      Weights associated with classes in the form ``{class_label: weight}``.
 |      If not given, all classes are supposed to have weight one.
 |  
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``
 |  
 |      Note that these weights will be multiplied with sample_weight (passed
 |      through the fit method) if sample_weight is specified.
 |  
 |      .. versionadded:: 0.17
 |         *class_weight='balanced'* instead of deprecated *class_weight='auto'*.
 |  
 |  max_iter : int
 |      Useful only for the newton-cg, sag and lbfgs solvers.
 |      Maximum number of iterations taken for the solvers to converge.
 |  
 |  random_state : int seed, RandomState instance, or None (default)
 |      The seed of the pseudo random number generator to use when
 |      shuffling the data.
 |  
 |  solver : {'newton-cg', 'lbfgs', 'liblinear', 'sag'}
 |      Algorithm to use in the optimization problem.
 |  
 |      - For small datasets, 'liblinear' is a good choice, whereas 'sag' is
 |          faster for large ones.
 |      - For multiclass problems, only 'newton-cg' and 'lbfgs' handle
 |          multinomial loss; 'sag' and 'liblinear' are limited to
 |          one-versus-rest schemes.
 |      - 'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty.
 |  
 |      Note that 'sag' fast convergence is only guaranteed on features with
 |      approximately the same scale. You can preprocess the data with a
 |      scaler from sklearn.preprocessing.
 |  
 |      .. versionadded:: 0.17
 |         Stochastic Average Gradient descent solver.
 |  
 |  tol : float, optional
 |      Tolerance for stopping criteria.
 |  
 |  multi_class : str, {'ovr', 'multinomial'}
 |      Multiclass option can be either 'ovr' or 'multinomial'. If the option
 |      chosen is 'ovr', then a binary problem is fit for each label. Else
 |      the loss minimised is the multinomial loss fit across
 |      the entire probability distribution. Works only for the 'lbfgs'
 |      solver.
 |  
 |  verbose : int
 |      For the liblinear and lbfgs solvers set verbose to any positive
 |      number for verbosity.
 |  
 |  warm_start : bool, optional
 |      When set to True, reuse the solution of the previous call to fit as
 |      initialization, otherwise, just erase the previous solution.
 |      Useless for liblinear solver.
 |  
 |      .. versionadded:: 0.17
 |         *warm_start* to support *lbfgs*, *newton-cg*, *sag* solvers.
 |  
 |  n_jobs : int, optional
 |      Number of CPU cores used during the cross-validation loop. If given
 |      a value of -1, all cores are used.
 |  
 |  Attributes
 |  ----------
 |  coef_ : array, shape (n_classes, n_features)
 |      Coefficient of the features in the decision function.
 |  
 |  intercept_ : array, shape (n_classes,)
 |      Intercept (a.k.a. bias) added to the decision function.
 |      If `fit_intercept` is set to False, the intercept is set to zero.
 |  
 |  n_iter_ : array, shape (n_classes,) or (1, )
 |      Actual number of iterations for all classes. If binary or multinomial,
 |      it returns only 1 element. For liblinear solver, only the maximum
 |      number of iteration across all classes is given.
 |  
 |  See also
 |  --------
 |  SGDClassifier : incrementally trained logistic regression (when given
 |      the parameter ``loss="log"``).
 |  sklearn.svm.LinearSVC : learns SVM models using the same algorithm.
 |  
 |  Notes
 |  -----
 |  The underlying C implementation uses a random number generator to
 |  select features when fitting the model. It is thus not uncommon,
 |  to have slightly different results for the same input data. If
 |  that happens, try with a smaller tol parameter.
 |  
 |  Predict output may not match that of standalone liblinear in certain
 |  cases. See :ref:`differences from liblinear <liblinear_differences>`
 |  in the narrative documentation.
 |  
 |  References
 |  ----------
 |  
 |  LIBLINEAR -- A Library for Large Linear Classification
 |      http://www.csie.ntu.edu.tw/~cjlin/liblinear/
 |  
 |  Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descent
 |      methods for logistic regression and maximum entropy models.
 |      Machine Learning 85(1-2):41-75.
 |      http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf
 |  
 |  Method resolution order:
 |      LogisticRegression
 |      sklearn.base.BaseEstimator
 |      sklearn.linear_model.base.LinearClassifierMixin
 |      sklearn.base.ClassifierMixin
 |      sklearn.feature_selection.from_model._LearntSelectorMixin
 |      sklearn.base.TransformerMixin
 |      sklearn.linear_model.base.SparseCoefMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit the model according to the given training data.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape (n_samples, n_features)
 |          Training vector, where n_samples in the number of samples and
 |          n_features is the number of features.
 |      
 |      y : array-like, shape (n_samples,)
 |          Target vector relative to X.
 |      
 |      sample_weight : array-like, shape (n_samples,) optional
 |          Array of weights that are assigned to individual samples.
 |          If not provided, then each sample is given unit weight.
 |      
 |          .. versionadded:: 0.17
 |             *sample_weight* support to LogisticRegression.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Returns self.
 |  
 |  predict_log_proba(self, X)
 |      Log of probability estimates.
 |      
 |      The returned estimates for all classes are ordered by the
 |      label of classes.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = [n_samples, n_features]
 |      
 |      Returns
 |      -------
 |      T : array-like, shape = [n_samples, n_classes]
 |          Returns the log-probability of the sample for each class in the
 |          model, where classes are ordered as they are in ``self.classes_``.
 |  
 |  predict_proba(self, X)
 |      Probability estimates.
 |      
 |      The returned estimates for all classes are ordered by the
 |      label of classes.
 |      
 |      For a multi_class problem, if multi_class is set to be "multinomial"
 |      the softmax function is used to find the predicted probability of
 |      each class.
 |      Else use a one-vs-rest approach, i.e calculate the probability
 |      of each class assuming it to be positive using the logistic function.
 |      and normalize these values across all the classes.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = [n_samples, n_features]
 |      
 |      Returns
 |      -------
 |      T : array-like, shape = [n_samples, n_classes]
 |          Returns the probability of the sample for each class in the model,
 |          where classes are ordered as they are in ``self.classes_``.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.linear_model.base.LinearClassifierMixin:
 |  
 |  decision_function(self, X)
 |      Predict confidence scores for samples.
 |      
 |      The confidence score for a sample is the signed distance of that
 |      sample to the hyperplane.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes)
 |          Confidence scores per (sample, class) combination. In the binary
 |          case, confidence score for self.classes_[1] where >0 means this
 |          class would be predicted.
 |  
 |  predict(self, X)
 |      Predict class labels for samples in X.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = [n_samples, n_features]
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape = [n_samples]
 |          Predicted class label per sample.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.feature_selection.from_model._LearntSelectorMixin:
 |  
 |  transform(*args, **kwargs)
 |      DEPRECATED: Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
 |      
 |      Reduce X to its most important features.
 |      
 |              Uses ``coef_`` or ``feature_importances_`` to determine the most
 |              important features.  For models with a ``coef_`` for each class, the
 |              absolute sum over the classes is used.
 |      
 |              Parameters
 |              ----------
 |              X : array or scipy sparse matrix of shape [n_samples, n_features]
 |                  The input samples.
 |      
 |              threshold : string, float or None, optional (default=None)
 |                  The threshold value to use for feature selection. Features whose
 |                  importance is greater or equal are kept while the others are
 |                  discarded. If "median" (resp. "mean"), then the threshold value is
 |                  the median (resp. the mean) of the feature importances. A scaling
 |                  factor (e.g., "1.25*mean") may also be used. If None and if
 |                  available, the object attribute ``threshold`` is used. Otherwise,
 |                  "mean" is used by default.
 |      
 |              Returns
 |              -------
 |              X_r : array of shape [n_samples, n_selected_features]
 |                  The input samples with only the selected features.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.TransformerMixin:
 |  
 |  fit_transform(self, X, y=None, **fit_params)
 |      Fit to data, then transform it.
 |      
 |      Fits transformer to X and y with optional parameters fit_params
 |      and returns a transformed version of X.
 |      
 |      Parameters
 |      ----------
 |      X : numpy array of shape [n_samples, n_features]
 |          Training set.
 |      
 |      y : numpy array of shape [n_samples]
 |          Target values.
 |      
 |      Returns
 |      -------
 |      X_new : numpy array of shape [n_samples, n_features_new]
 |          Transformed array.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.linear_model.base.SparseCoefMixin:
 |  
 |  densify(self)
 |      Convert coefficient matrix to dense array format.
 |      
 |      Converts the ``coef_`` member (back) to a numpy.ndarray. This is the
 |      default format of ``coef_`` and is required for fitting, so calling
 |      this method is only required on models that have previously been
 |      sparsified; otherwise, it is a no-op.
 |      
 |      Returns
 |      -------
 |      self: estimator
 |  
 |  sparsify(self)
 |      Convert coefficient matrix to sparse format.
 |      
 |      Converts the ``coef_`` member to a scipy.sparse matrix, which for
 |      L1-regularized models can be much more memory- and storage-efficient
 |      than the usual numpy.ndarray representation.
 |      
 |      The ``intercept_`` member is not converted.
 |      
 |      Notes
 |      -----
 |      For non-sparse models, i.e. when there are not many zeros in ``coef_``,
 |      this may actually *increase* memory usage, so use this method with
 |      care. A rule of thumb is that the number of zero elements, which can
 |      be computed with ``(coef_ == 0).sum()``, must be more than 50% for this
 |      to provide significant benefits.
 |      
 |      After calling this method, further fitting with the partial_fit
 |      method (if any) will not work until you call densify.
 |      
 |      Returns
 |      -------
 |      self: estimator

In [47]:
confidences = model.decision_function(X)
In [48]:
confidences
Out[48]:
array([3.04132458, 5.88211478, 4.01947879, ..., 7.03240358, 7.02408501,
       0.28614467])

In particular, we are interested in whether the positive labels are assigned high confidence (i.e., positive instances should appear near the top of the ranking). To determine this, we sort the labels according to their confidence scores:

In [49]:
confidencesAndLabels = list(zip(confidences,y_class))
In [50]:
confidencesAndLabels
Out[50]:
[(3.0413245811865908, True),
 (5.882114780538816, True),
 (4.019478790021274, True),
 (0.21425627264410307, True),
 (4.509963266551246, True),
 (3.858252290469597, True),
 (7.23750702604945, True),
 (8.668115697370716, True),
 (7.566571159616022, True),
 (6.716375774888643, True),
 (4.803110258239213, True),
 (3.8594051302696073, True),
 (0.6423382288602191, True),
 (3.8583119674689614, True),
 (3.4464535711136737, True),
 (0.9994430814394627, True),
 (3.2014133429952816, True),
 (5.590657019759745, True),
 (3.2014133429952816, False),
 (4.794015611878226, True),
 (5.094234254310385, True),
 (4.501355816637814, True),
 (11.611551343451342, True),
 (10.53689983416234, True),
 (4.908506753542062, True),
 (7.898490943033232, True),
 (6.4750526850112, True),
 (3.3250847337883007, True),
 (3.2014133429952816, True),
 (4.820881775931565, True),
 (4.6194866437602995, True),
 (6.759865398702967, True),
 (2.331854397520121, True),
 (2.568357911334152, True),
 (6.991013935821071, True),
 (6.514203169500977, True),
 (0.027168397382675735, False),
 (6.19551058546523, True),
 (7.05770888907893, True),
 (5.13404115847007, True),
 (11.996011459099066, True),
 (4.921415433358495, True),
 (3.643593772336142, True),
 (3.9799713465831106, True),
 (7.423230212084493, True),
 (8.588683315728863, True),
 (6.983257966304148, True),
 (6.881470843910153, True),
 (2.331854397520121, True),
 (7.413583802759575, True),
 (2.85739973974578, True),
 (2.174896699469438, True),
 (5.440178679774148, True),
 (3.5781553293489043, True),
 (4.187354626623753, True),
 (9.76659204102196, True),
 (4.45378042950138, True),
 (4.820881775931565, True),
 (4.949892364591952, True),
 (5.2721031895905845, True),
 (2.994091384494462, True),
 (6.491555644356665, True),
 (6.041218860206764, True),
 (3.654155496043346, True),
 (1.8294137967806434, True),
 (4.820881775931565, True),
 (5.357383901704399, True),
 (-3.47728259119952, False),
 (5.025458278575022, True),
 (5.7099706079230454, True),
 (0.7965557948349573, True),
 (2.6699452516920115, True),
 (5.889201491385315, True),
 (6.023464109505108, True),
 (2.082351854431661, True),
 (-1.6742348686212394, False),
 (4.820881775931565, True),
 (4.921415433358495, True),
 (5.931579860072395, True),
 (4.659074242836584, True),
 (3.7319989820105866, True),
 (2.272698078259345, True),
 (4.066348597280217, True),
 (4.820881775931565, True),
 (3.8934451253292766, True),
 (0.9475633531578446, False),
 (4.881319450227488, True),
 (3.5129705721171627, True),
 (2.331854397520121, True),
 (4.845544511155318, True),
 (3.53399156056855, True),
 (5.047456797625842, True),
 (5.92280851918551, True),
 (2.5038409503408374, True),
 (4.439429806070564, True),
 (10.006761125931769, True),
 (5.152510373779086, True),
 (-1.2321577499361103, False),
 (8.111003990543749, True),
 (8.215611038826145, True),
 (3.4115163259940697, True),
 (2.2594135097748937, True),
 (3.509913789288311, True),
 (6.51932668912424, True),
 (5.586476020913496, True),
 (5.842085538602182, True),
 (4.6318920217553945, True),
 (4.921415433358495, True),
 (3.128391682776514, True),
 (3.645273290215897, True),
 (3.253077215453639, True),
 (4.803110258239213, True),
 (2.331854397520121, True),
 (2.1990919753006843, True),
 (4.486632973156167, True),
 (5.946695350009215, True),
 (2.8619939755148005, True),
 (5.1512095732241745, True),
 (3.2120600459836135, True),
 (2.331854397520121, True),
 (3.5714615436312434, True),
 (5.5605751916043085, True),
 (5.285162334241612, True),
 (4.820881775931565, True),
 (4.180441377319466, True),
 (6.108352514913858, True),
 (4.471405994642479, True),
 (6.667352265223963, True),
 (8.98738029411534, True),
 (4.305060701354591, True),
 (5.012756371647175, True),
 (1.7583076472763477, True),
 (9.080316589384223, True),
 (7.010931357456087, True),
 (3.2966500380974804, True),
 (6.052327469454112, True),
 (2.6919756978511593, True),
 (3.1383532226812934, True),
 (3.3046335600617196, True),
 (8.575613864127348, True),
 (4.283524441342786, True),
 (7.038598524102502, True),
 (4.747642256586034, True),
 (6.242225031190601, True),
 (1.9790041599610702, True),
 (7.430866891520247, True),
 (3.3528116161173998, True),
 (5.308188525200066, True),
 (3.265022942687331, True),
 (4.550699852327108, True),
 (4.820881775931565, True),
 (-2.729212011996508, False),
 (10.639229798834029, True),
 (2.331854397520121, True),
 (3.1534950103262815, True),
 (5.557188997056038, True),
 (4.820881775931565, True),
 (7.467815206351767, True),
 (2.099556876453362, True),
 (4.820881775931565, True),
 (4.7096443870469455, True),
 (3.6647670323261465, True),
 (7.8860174122606015, True),
 (4.391516992594831, True),
 (3.53399156056855, True),
 (4.354735303943411, True),
 (6.95277407305322, True),
 (4.319160466570116, True),
 (4.313303892055009, True),
 (5.048924837804898, True),
 (9.781868169076423, True),
 (6.632962016283561, True),
 (2.331854397520121, True),
 (0.8511528928521724, False),
 (2.0405610037215225, True),
 (4.081661856832745, True),
 (10.19009264494994, True),
 (6.380303224337287, True),
 (1.64582566320306, False),
 (4.753270075617059, True),
 (0.36083854764529477, False),
 (7.0763720194100825, True),
 (4.2119018739005005, True),
 (4.489170439254736, True),
 (9.420579168105581, True),
 (6.693658446229863, True),
 (6.492686080585503, True),
 (8.650816880266868, True),
 (3.2014133429952816, True),
 (-0.49265348356761707, False),
 (2.9627129342118064, True),
 (8.719956000323927, True),
 (5.89489728824829, True),
 (9.956603912233152, True),
 (4.768407001324703, True),
 (4.832466417973228, True),
 (5.402465050847325, True),
 (5.868990495174575, True),
 (5.078014968936599, True),
 (4.36149915084678, True),
 (5.895097336987788, True),
 (3.540878694294449, True),
 (3.917731684864145, True),
 (10.904489629097517, True),
 (6.483525358349936, True),
 (4.985144474247337, True),
 (3.176705434445399, True),
 (4.820881775931565, True),
 (4.493791645660476, True),
 (3.8265649175966505, True),
 (3.2501729860156807, True),
 (3.930693801063745, True),
 (1.9653504362708794, False),
 (2.331854397520121, True),
 (9.559058868912006, True),
 (4.820881775931565, True),
 (9.049808382913508, True),
 (8.285246798652341, True),
 (3.2774850209596496, True),
 (9.142385899334752, True),
 (4.203758214337603, True),
 (4.3919720092245935, True),
 (7.087397139159019, True),
 (7.785826309731473, True),
 (5.912211326270311, True),
 (4.183026189232396, True),
 (3.8662297289101843, True),
 (9.215683295251953, True),
 (10.068066759443711, True),
 (7.917270688493638, True),
 (1.3634399434218696, True),
 (5.502705449366108, True),
 (2.4685970710662284, True),
 (1.99981184448273, True),
 (5.47434183956771, True),
 (1.0981872827763364, True),
 (2.4967345466974358, True),
 (2.578382987311136, True),
 (6.951110683089977, True),
 (5.558641746562831, True),
 (4.137431352435629, True),
 (3.9373196177961187, True),
 (-0.46220554399795244, False),
 (5.2151073376410215, True),
 (3.0596717935722024, True),
 (4.820881775931565, True),
 (9.151676910406023, True),
 (4.667124415055057, True),
 (10.72205578121478, True),
 (4.387422939641154, True),
 (5.182159590177556, True),
 (3.893023776314262, True),
 (5.946051242851249, True),
 (3.3105788157125824, True),
 (3.5889187689501867, True),
 (1.4358714583258965, True),
 (6.4448997402862895, True),
 (3.637632957317088, True),
 (5.687848201785986, True),
 (2.190918886884543, True),
 (2.9627129342118064, True),
 (3.6791439526442433, True),
 (7.206028319841418, True),
 (4.01128917806531, True),
 (4.15525678721438, True),
 (2.055925930473155, True),
 (1.4234969728760367, False),
 (3.859723087300117, True),
 (7.300897103348177, True),
 (5.219110357147408, True),
 (2.2711308645590575, True),
 (1.3197303574303727, False),
 (4.915444863542067, True),
 (2.574460954948118, True),
 (5.242314181173989, True),
 (4.670251589280255, True),
 (3.415816101753637, True),
 (2.9859441761584153, True),
 (5.890120211616491, True),
 (7.331828491054494, True),
 (5.751082133240478, True),
 (1.884976415610879, True),
 (3.268418160939915, True),
 (4.28495121224829, True),
 (4.485472599372089, True),
 (6.298724246185491, True),
 (-5.014829537600713, False),
 (5.553956758197767, True),
 (1.9276137453689135, True),
 (1.5688520768256153, True),
 (-3.136326368255018, False),
 (2.851971900870141, True),
 (3.791226934032812, True),
 (9.724101193489636, True),
 (15.122391316368779, True),
 (2.7683618226069067, True),
 (0.37596290058211734, True),
 (4.294789337348786, True),
 (4.515170924688268, True),
 (5.208669689794778, True),
 (4.744388028027894, True),
 (3.2610056147843336, True),
 (4.514438268045216, True),
 (4.366792707718403, True),
 (-1.5176626764714538, False),
 (3.8583119674689614, True),
 (4.820881775931565, True),
 (4.818899562696924, True),
 (5.046175043801265, True),
 (8.320489724801021, True),
 (5.095591483818575, True),
 (9.891830082295328, True),
 (3.3476708276453406, True),
 (3.550327170163273, True),
 (9.104893196819061, True),
 (7.386912910667062, True),
 (12.885832697153933, True),
 (6.063701134353668, True),
 (-2.032057608806088, False),
 (7.458563369618983, True),
 (1.3619985530210656, True),
 (5.351381266095958, True),
 (5.771498923510298, True),
 (4.991447868019162, True),
 (4.262999033426551, True),
 (4.579334533432945, True),
 (8.33577128014252, True),
 (4.95081659351958, True),
 (-8.173547134639124, False),
 (4.803110258239213, True),
 (4.487758793737053, True),
 (5.0671817293146635, True),
 (5.281406712869599, True),
 (8.344446691898634, True),
 (2.3827504620409865, True),
 (5.983316213201307, True),
 (3.2014133429952816, True),
 (8.656994802036387, True),
 (2.4700883116686794, True),
 (3.149611176313128, True),
 (7.363912965714289, True),
 (7.676230817573933, True),
 (3.9283420576665122, True),
 (7.545785662422247, True),
 (2.254024780483211, True),
 (8.464970951550248, True),
 (6.7356391570231615, True),
 (6.40882857967169, True),
 (4.696082633535264, True),
 (3.26353040543093, True),
 (6.032831255209788, True),
 (6.718806129093905, True),
 (5.328838901307763, True),
 (5.028241546364799, True),
 (13.41926984968299, True),
 (4.97273936115122, True),
 (5.031503434214946, True),
 (6.208339004929012, True),
 (3.791440970057468, True),
 (10.708033038033628, True),
 (7.8843048969033, True),
 (3.6910553205179353, True),
 (6.343549044760943, True),
 (7.677101567970379, True),
 (7.667645221559027, True),
 (3.6010630646511554, True),
 (4.820881775931565, True),
 (5.8586158849873495, True),
 (4.805719531718678, True),
 (3.741954286346159, True),
 (-2.4865786977700095, False),
 (7.161666396825592, True),
 (3.9412606612871253, True),
 (7.492669328607702, True),
 (-2.8110707037877107, False),
 (6.143699048376176, True),
 (5.78037000436523, True),
 (4.328025001763827, True),
 (1.3044928208625484, True),
 (7.685489314306048, True),
 (5.958781819309055, True),
 (6.320669662064875, True),
 (4.9517271706552854, True),
 (4.418543003723321, True),
 (4.987670705598773, True),
 (3.7879949759657574, True),
 (-0.20444036621941897, False),
 (0.16270475868077883, True),
 (6.5540752776154845, True),
 (1.5409869040100248, False),
 (4.921415433358495, True),
 (4.751437399382636, True),
 (2.2759450886466555, True),
 (5.038920460712644, True),
 (3.9344956467615635, True),
 (2.7692294004564655, True),
 (2.1864485673429517, False),
 (3.53399156056855, True),
 (6.238245378029504, True),
 (4.963038724286984, True),
 (7.19991818647287, True),
 (4.881254112813989, True),
 (4.359274313703038, True),
 (3.917731684864145, True),
 (2.1871186488125067, True),
 (4.283524441342786, True),
 (9.836192689706145, True),
 (3.8271588563762666, True),
 (8.06293237049155, True),
 (4.744388028027894, True),
 (4.921415433358495, True),
 (2.461720499247588, True),
 (4.507939756985895, True),
 (9.262052971050245, True),
 (-6.109222970691342, False),
 (5.497662749310878, True),
 (7.625744487041248, True),
 (3.360052704241256, True),
 (5.304503173842215, True),
 (4.966613547408576, True),
 (3.53399156056855, True),
 (2.5842594793864295, True),
 (5.28073918755069, True),
 (5.737439342503901, True),
 (4.285292963213946, True),
 (4.998820258008979, True),
 (4.744388028027894, True),
 (8.401931778423576, True),
 (3.511715169972211, True),
 (6.762816578381045, True),
 (-6.684515598000174, False),
 (3.623737522734845, True),
 (8.294809869059714, True),
 (5.871114451344754, True),
 (3.6989662879088474, True),
 (-7.358610304414043, False),
 (-0.545435545509559, False),
 (4.775066940249132, True),
 (7.195447958838868, False),
 (2.331854397520121, True),
 (19.9704405345226, True),
 (4.820881775931565, True),
 (4.064230415979768, True),
 (4.390380293157214, True),
 (0.42939871155866605, True),
 (2.267928499216898, True),
 (7.111801723083685, True),
 (5.880877159156641, True),
 (4.334139080054571, True),
 (6.464402379588416, True),
 (4.66964343281139, True),
 (9.454877581106674, True),
 (5.8534198209055575, True),
 (2.0481409133700446, True),
 (4.051587730014681, True),
 (4.744388028027894, True),
 (4.19920348505575, True),
 (4.605645335013199, True),
 (4.665998314795157, True),
 (1.0361327350883442, False),
 (6.3613734088213185, True),
 (5.689932859907012, True),
 (3.1904453876549326, True),
 (6.94286363621976, True),
 (1.7015848800387374, True),
 (7.656670380809159, True),
 (9.509562700346125, True),
 (6.24329943298098, True),
 (2.747652930634767, True),
 (5.877316010899383, True),
 (5.741616006621302, True),
 (4.841672571769557, True),
 (4.667124415055057, True),
 (7.692601368850825, True),
 (3.2014133429952816, True),
 (5.114117581623724, True),
 (2.824103890376763, True),
 (9.36813803878609, True),
 (4.021901699520775, True),
 (4.066348597280217, True),
 (2.4804713099361932, True),
 (3.8087050808403973, True),
 (4.180441377319466, True),
 (3.53399156056855, True),
 (2.8795755558643377, True),
 (2.6268682083771546, True),
 (6.181197355285973, True),
 (3.411090710481294, True),
 (3.0946426061893604, True),
 (3.32242979209163, True),
 (7.139865570022897, True),
 (2.9945504461768078, True),
 (4.049655243775111, True),
 (4.350867684510403, True),
 (7.634745809891241, True),
 (4.283524441342786, True),
 (3.2014133429952816, True),
 (2.875783144711762, True),
 (3.0437838792265417, True),
 (6.629203932564463, True),
 (2.9258202746339323, True),
 (5.074875517753984, True),
 (5.322192772648609, True),
 (3.3666333598049603, True),
 (6.410133044984902, True),
 (3.2014133429952816, True),
 (3.903534818691994, True),
 (3.2014133429952816, True),
 (4.744388028027894, True),
 (3.8727562532701985, True),
 (5.532811257637985, True),
 (6.111656624012058, True),
 (2.331854397520121, True),
 (-0.6607977091294945, True),
 (2.1918268696707246, True),
 (-2.739070672219996, False),
 (6.146857436014061, True),
 (1.2914758229401337, True),
 (4.624572493280082, True),
 (6.822849125802289, True),
 (2.9253991736756486, True),
 (2.575983657294957, True),
 (-1.8338719338214486, False),
 (7.039932389992217, True),
 (4.592822381066233, True),
 (3.221566282034283, True),
 (1.5923222428328494, True),
 (6.208232164653961, True),
 (3.576336730735294, True),
 (5.929270917048649, True),
 (3.512088177113421, True),
 (5.016658733598741, True),
 (4.147021081476549, True),
 (4.026243114644825, True),
 (5.777114089458849, True),
 (10.728036884794227, True),
 (0.44725162534311225, True),
 (2.8419394932260698, True),
 (5.4517919156011025, True),
 (4.817141727874876, True),
 (3.5044646897653404, True),
 (5.636061236943463, True),
 (5.69840154461916, True),
 (2.021143310479964, True),
 (4.996333581513574, True),
 (-0.33407282403595073, True),
 (6.44530608638342, True),
 (5.321044618928656, True),
 (9.003846892537196, True),
 (6.056845493952909, True),
 (6.213877230338662, True),
 (5.761687492477426, True),
 (3.53399156056855, True),
 (4.68690034413262, True),
 (7.717154380771863, True),
 (0.7830542788727424, True),
 (3.2014133429952816, True),
 (9.599004184243798, True),
 (2.447337801092763, True),
 (2.9627129342118064, True),
 (3.9414867176668436, True),
 (8.151130053952478, True),
 (4.12760884403164, True),
 (3.828253618514336, True),
 (-1.0527503581350548, False),
 (3.0362950238197484, True),
 (2.331854397520121, True),
 (9.275655722976476, True),
 (3.8583119674689614, True),
 (5.296367873608638, True),
 (6.020660805391276, True),
 (4.803110258239213, True),
 (1.9605368121528375, True),
 (0.8640976249872798, False),
 (4.239462068436206, True),
 (3.6036716884726507, True),
 (7.226327928741723, True),
 (7.855672624775103, True),
 (4.803110258239213, True),
 (4.692404495886103, True),
 (9.746869369083662, True),
 (5.075832775157318, True),
 (3.1620670119076535, True),
 (6.319049167235745, True),
 (4.8784814596717405, True),
 (5.8206756924822605, True),
 (2.4804267087747327, True),
 (4.283524441342786, True),
 (3.8768750205748415, True),
 (3.099852057377709, True),
 (4.453481356979632, True),
 (-3.9972161605847836, False),
 (6.625461664382044, True),
 (2.811093230570897, True),
 (5.284464129012848, True),
 (4.073845919552372, True),
 (2.0629726510808815, True),
 (6.9902384049964805, True),
 (4.157191770134899, True),
 (4.921415433358495, True),
 (4.744388028027894, True),
 (4.820881775931565, True),
 (3.2014133429952816, True),
 (8.027451972824748, True),
 (14.426940638257305, True),
 (2.407277772758614, True),
 (4.589444146790746, True),
 (0.593239246772436, False),
 (-0.6389988749681093, False),
 (-2.7317549861444954, False),
 (8.75506428848013, True),
 (5.717815925308151, True),
 (4.7752880157665025, True),
 (1.1264657588156033, True),
 (5.306987226925738, True),
 (9.971421812546408, True),
 (6.979084923935958, True),
 (7.211961177811509, True),
 (9.812448869708993, True),
 (4.1770657622107885, True),
 (4.204420879330773, True),
 (6.055731383133406, True),
 (4.312017726387598, True),
 (3.6932556756169674, True),
 (2.8138732172788363, True),
 (4.9835649542782585, True),
 (7.739496389775297, True),
 (10.955234460943718, True),
 (1.1007758083610308, True),
 (6.859817284791337, True),
 (6.154605227001748, True),
 (5.5089695743967795, True),
 (10.942891106227776, True),
 (3.097781990456472, True),
 (6.859906021123698, True),
 (5.919638594780411, True),
 (8.7204068795373, True),
 (4.823479622475661, True),
 (-2.271936127476307, False),
 (3.405106806133551, True),
 (1.7227701378948685, True),
 (5.833835404198226, True),
 (2.738345602411946, True),
 (8.468093917232958, True),
 (1.0243114605002426, True),
 (8.397670660676289, True),
 (5.984261885071062, True),
 (4.556498448905192, True),
 (3.128391682776514, True),
 (3.2014133429952816, True),
 (4.9517271706552854, True),
 (8.33371975436692, True),
 (5.864332349645645, True),
 (7.138098922695112, True),
 (6.021992180307647, True),
 (3.6647604910961755, True),
 (5.710204624303133, True),
 (4.820881775931565, True),
 (6.6756674448633415, True),
 (6.369330643047254, True),
 (6.395098000444708, True),
 (7.452606800020638, True),
 (5.4543363908043165, True),
 (2.0171460159866075, True),
 (-0.14410788270557218, True),
 (4.36180597834885, True),
 (3.1513267280849124, True),
 (7.859923207413773, True),
 (4.4140627282084575, True),
 (2.7435507990914454, True),
 (3.560872120091874, True),
 (3.54246080511479, True),
 (4.96728413734221, True),
 (10.314961132568747, True),
 (1.8419507432868047, True),
 (4.108883420922667, True),
 (7.247200917499425, True),
 (5.780207163181769, True),
 (4.525622631072597, True),
 (3.0676681498154506, True),
 (4.744388028027894, True),
 (10.435427404553339, True),
 (6.634679998926726, True),
 (5.644445203267797, False),
 (3.858059464248684, True),
 (6.755952472105972, True),
 (2.0669388936348456, True),
 (5.234668196371727, True),
 (4.962164270867837, True),
 (2.9627129342118064, True),
 (11.708425182396228, True),
 (3.53399156056855, True),
 (4.744388028027894, True),
 (4.066348597280217, True),
 (3.9249849255966254, True),
 (10.261297007663538, True),
 (1.8182246507913697, False),
 (1.4308944889304405, False),
 (2.93685609025985, True),
 (12.650388507124644, True),
 (7.052781976041732, True),
 (10.5698027506238, True),
 (1.2452670183307992, True),
 (5.709163086670288, True),
 (4.066348597280217, True),
 (1.4586868607074641, True),
 (5.798492469635853, True),
 (0.6379002819757391, True),
 (6.693302680265485, True),
 (4.590863939597083, True),
 (10.92653559137799, True),
 (3.764399083486139, True),
 (0.5513985603347549, True),
 (2.331854397520121, True),
 (10.83538685100453, True),
 (4.146951144876585, True),
 (4.744388028027894, True),
 (8.118496647091447, True),
 (6.224444297849936, True),
 (1.1075167049708259, False),
 (2.0171460159866075, True),
 (4.744388028027894, True),
 (4.250216631003645, True),
 (6.506694384909, True),
 (7.124086210570725, True),
 (2.8261549125377066, True),
 (3.2014133429952816, True),
 (6.068314142168714, True),
 (4.879454965301247, True),
 (4.871328818448126, True),
 (3.9540053701106412, True),
 (5.368309054230538, True),
 (4.820881775931565, True),
 (9.634607924554649, True),
 (4.018140709939476, True),
 (4.1724184217415745, True),
 (8.37540974357733, True),
 (11.280381687126825, True),
 (4.75864898504434, True),
 (9.457677696426002, True),
 (6.218611897570552, True),
 (4.789454591271836, True),
 (3.137445776193948, True),
 (3.6848289251651414, True),
 (2.4098248909643214, True),
 (7.111815697411618, True),
 (2.075031920739794, True),
 (4.544622264767515, True),
 (4.066348597280217, True),
 (5.728493948261559, True),
 (4.820881775931565, True),
 (4.562940130009228, True),
 (3.1261628489179927, True),
 (3.2014133429952816, True),
 (3.8665697781418196, True),
 (9.004701053238328, True),
 (4.045209625136314, True),
 (5.990384846640078, True),
 (2.9473671560227404, True),
 (4.095611524856981, True),
 (3.9133425207635755, True),
 (-0.04248896232126298, False),
 (4.820881775931565, True),
 (6.377230007286511, True),
 (3.2597247070630573, True),
 (4.189102597935935, True),
 (5.372185243818332, True),
 (5.227809326324859, True),
 (4.148274005363684, True),
 (2.8581567902472833, True),
 (-4.481138562518444, False),
 (5.502566406059597, True),
 (3.917731684864145, True),
 (3.122820864769861, True),
 (5.230250553378502, True),
 (2.5592506589449284, True),
 (2.1899411639394097, True),
 (2.0239516241487303, True),
 (1.6087503090924138, False),
 (5.289508489069932, True),
 (4.322068741596702, True),
 (2.2599340337529856, True),
 (5.184320502618528, True),
 (9.24224153936562, True),
 (8.726211167348188, True),
 (2.0171460159866075, True),
 (6.998439949350738, True),
 (5.304153358191057, True),
 (4.9640805316809375, True),
 (3.779610819359326, True),
 (6.703167574164327, True),
 (3.922663575531665, True),
 (3.7490836392594753, True),
 (3.212000149109465, True),
 (1.22497824673526, True),
 (3.934905698527902, True),
 (3.123333179781218, True),
 (10.270796599860946, True),
 (8.044530498120041, True),
 (1.9835474810319336, True),
 (5.7929885256488385, True),
 (8.571857449517912, True),
 (5.588943805093535, True),
 (9.839922124074539, True),
 (0.33266123527848257, False),
 (9.069179002840519, True),
 (6.176676740874985, True),
 (5.184384071027792, True),
 (2.7426511456182414, True),
 (4.035716091216753, True),
 (4.820881775931565, True),
 (4.283524441342786, True),
 (5.174663784534385, True),
 (3.329326375369729, True),
 (7.510853622131483, True),
 (6.393079370801835, True),
 (2.7188145472664615, True),
 (2.6003734257953868, True),
 (3.7948650886330784, True),
 (5.117950212102642, True),
 (1.2986776926860806, True),
 (5.903902403364349, True),
 (4.874176050372616, True),
 (3.2014133429952816, True),
 (-2.327279835042903, False),
 (3.890945075799136, True),
 (2.428780471833979, True),
 (6.663434063122478, True),
 (3.2014133429952816, True),
 (0.957067143102706, False),
 (4.558084715708093, True),
 (2.731888619932521, True),
 (4.362151412404554, True),
 (6.341270081277262, True),
 (3.4755594215920818, True),
 (3.1024705611474594, True),
 (5.061583564474151, True),
 (2.399281609316546, True),
 (6.851454916997191, True),
 (2.6688089450590753, True),
 (1.2317190190687508, False),
 (5.473248975204156, True),
 (3.8370193425432415, True),
 (4.820881775931565, True),
 (11.272630473367471, True),
 (5.095888673955886, True),
 (5.701959102266478, True),
 (2.9008172075345233, True),
 (4.238709126691847, True),
 (3.947242780636566, True),
 (15.365307119304079, True),
 (4.646683991949524, True),
 (4.500335462049964, True),
 (3.235490124514447, True),
 (7.060108265846611, True),
 (4.744388028027894, True),
 (2.331854397520121, True),
 (4.356832831633222, True),
 (8.720540720414183, True),
 (3.312971044097666, True),
 (3.126362544684997, True),
 (4.899527405113196, True),
 (2.8257890426517833, True),
 (7.489431334164957, True),
 (4.872050648888304, True),
 (6.60436685764852, True),
 (5.4401273663684035, True),
 (4.893004940443966, True),
 (6.396506028592125, True),
 (3.1971865029070825, True),
 (5.451209688434279, True),
 (6.228877856351414, True),
 (4.051379611406503, True),
 (7.940935869721045, True),
 (16.08584878535002, True),
 (5.211392842381265, True),
 (9.197842642644801, True),
 (7.27375305345251, True),
 (3.517884425614347, True),
 (7.0252505403915055, True),
 (4.016888297092487, True),
 (4.058347372597925, True),
 (9.842000918158517, True),
 (7.336896500399088, True),
 (3.5781553293489043, True),
 (3.8583119674689614, True),
 (3.128391682776514, True),
 (7.342323487822246, True),
 (3.3643655951451397, True),
 (5.581277711027255, True),
 (3.128391682776514, True),
 (5.719739076869143, True),
 (2.845805240318727, True),
 (5.165250180550547, True),
 (7.730980285222186, True),
 (3.5486803570483882, True),
 (4.604413719584362, True),
 (3.2014133429952816, True),
 (6.166366897320775, True),
 (6.50460174879844, True),
 (6.9140710957967, True),
 (4.055139738422694, True),
 (3.013722916150101, False),
 (4.9463862769255975, True),
 (0.40269607952841757, True),
 (5.8419584127901105, True),
 (3.898415447563467, True),
 (10.2964820494881, True),
 (4.744388028027894, True),
 (5.95067228171199, True),
 (5.49088026376036, True),
 (2.7583352205942706, True),
 (5.046690386185147, True),
 (2.7665641052438703, True),
 (3.381984136315996, True),
 (9.235632285492278, True),
 (6.1560751297863705, True),
 (5.099864958228748, True),
 (4.744388028027894, True),
 (8.422249627404746, True),
 (5.78037000436523, True),
 (7.699493682093466, True),
 (2.8060032245382476, True),
 (3.53399156056855, True),
 (7.079176467853633, True),
 (8.075348324876634, True),
 (2.331854397520121, True),
 (-0.02683627144792844, False),
 (3.6525555543165247, True),
 (8.0175548566142, True),
 (8.410002190947585, True),
 (4.343201550721822, True),
 (4.744388028027894, True),
 (4.975040918686501, True),
 (3.2957232327775516, True),
 (4.820881775931565, True),
 (8.30455550339739, True),
 (2.0171460159866075, True),
 (6.310431912415156, True),
 (2.331854397520121, True),
 (4.65717761563794, True),
 (6.815566175039866, True),
 (3.484916037681634, True),
 (11.988503572945364, True),
 (1.7030261178197788, True),
 (4.744388028027894, True),
 (9.188965221652838, True),
 (6.43578617897923, True),
 (6.784775866394668, True),
 (0.2526962730631075, False),
 (0.27536539899575796, True),
 (10.13689213851937, True),
 (4.921415433358495, True),
 (3.53399156056855, True),
 (0.6502044234869404, False),
 (4.9820262477874016, True),
 (4.803110258239213, True),
 (2.9473671560227404, True),
 (5.7050492773873485, True),
 (7.251194354204543, True),
 (2.7798701995894506, True),
 (6.325354173595779, True),
 (19.07501065973475, True),
 (2.864758131670385, True),
 (3.8453259373117206, True),
 (3.8607319466920966, True),
 (5.8234584813336285, True),
 (4.474271986449827, True),
 (11.312650749435427, True),
 (4.8991100902222735, True),
 (2.8799949799915563, True),
 (4.295099445703872, True),
 (9.988905541322639, True),
 (2.331854397520121, True),
 (4.910317440738507, True),
 (6.0398730206427445, True),
 (4.250216631003645, True),
 (5.182159590177556, True),
 (5.494437632214286, True),
 (13.896077011926504, True),
 (4.277950472983572, False),
 (13.112912503828882, True),
 (3.50738027358726, True),
 (3.4876156981213438, True),
 (5.443173607221099, True),
 (-0.08647962982736823, True),
 (6.126780870655441, True),
 (3.1983946464042354, True),
 (4.97136861356018, True),
 (-0.3742242143848047, False),
 (4.938357115557832, True),
 (7.322544052158498, True),
 (3.0119165733342173, False),
 (7.244696684547112, True),
 (5.38409208480644, True),
 (2.6045817986579296, True),
 (4.820881775931565, True),
 (8.66445587338627, True),
 (3.0921133802468876, True),
 (5.511945677696504, True),
 ...]
In [51]:
confidencesAndLabels.sort()
confidencesAndLabels.reverse()
In [52]:
confidencesAndLabels
Out[52]:
[(36.799763631241866, True),
 (34.979498841095634, True),
 (33.25376124776206, True),
 (33.115088841758435, True),
 (31.534399610309706, True),
 (30.776228313229424, True),
 (28.90047363954259, True),
 (28.24222836301973, True),
 (27.864643511532236, True),
 (26.73927967862375, True),
 (25.79878608902182, True),
 (25.715774519423952, True),
 (25.44307708489774, True),
 (24.9220191988412, True),
 (24.72761276048071, True),
 (24.596256184270594, True),
 (23.94318880781826, True),
 (23.797041639259742, True),
 (23.732291824444037, True),
 (23.475306038815987, True),
 (22.813736791678142, True),
 (22.803462216418716, True),
 (22.36584831527163, True),
 (22.322140239514233, True),
 (22.30121198826434, True),
 (22.251407897305505, True),
 (21.957323548241014, True),
 (21.672501979979298, True),
 (21.59935281085968, True),
 (21.577850237407457, True),
 (21.491382474067507, True),
 (21.47401198809215, True),
 (21.365733375322005, True),
 (21.289043428107995, True),
 (21.075766513536138, True),
 (21.0306013405361, True),
 (20.965157441517334, True),
 (20.935141544311975, True),
 (20.825355346563086, True),
 (20.755730203775233, True),
 (20.728329959591957, True),
 (20.717886399664337, True),
 (20.573090215718125, True),
 (20.517445075207277, True),
 (20.493123997130915, True),
 (20.37737339549887, True),
 (20.218930492303002, True),
 (20.206792445006453, True),
 (20.108318522454802, True),
 (20.106072381392618, True),
 (20.081766734277533, True),
 (20.013744930159223, True),
 (19.9704405345226, True),
 (19.942076861746777, True),
 (19.87118875802444, True),
 (19.77958016447198, True),
 (19.777150831031534, True),
 (19.75539778971384, True),
 (19.74901968386512, True),
 (19.681577059239896, True),
 (19.65148025763777, True),
 (19.615989515884927, True),
 (19.565564282145615, True),
 (19.544012487403847, True),
 (19.52538626180101, True),
 (19.478702654874983, True),
 (19.30343760236741, True),
 (19.258104820394, True),
 (19.241614854930972, True),
 (19.159435289446293, True),
 (19.146772378123845, True),
 (19.105032325157577, True),
 (19.07501065973475, True),
 (18.852223764518733, True),
 (18.82389368824358, True),
 (18.696088114014564, True),
 (18.64162324421662, True),
 (18.5507867324242, True),
 (18.498511930529908, True),
 (18.464679043973394, True),
 (18.436369898383468, True),
 (18.409804480144324, True),
 (18.401566291183514, True),
 (18.394141667356212, True),
 (18.38676241716992, True),
 (18.374152925983676, True),
 (18.373242353694856, True),
 (18.33630121651042, True),
 (18.248090525844265, True),
 (18.246792536192416, True),
 (18.16862383989985, True),
 (18.15266007416658, True),
 (18.11882973637556, True),
 (18.07795880958452, True),
 (18.060265202012957, True),
 (18.038951608653818, True),
 (18.011949453490633, True),
 (17.981100262060796, True),
 (17.979909779101092, True),
 (17.947238322004672, True),
 (17.94573510319575, True),
 (17.933541985772266, True),
 (17.921167335698495, True),
 (17.87821618042861, True),
 (17.870832242069167, True),
 (17.85554164199563, True),
 (17.84379975482839, True),
 (17.839805399710862, True),
 (17.755371227370734, True),
 (17.716041553243024, True),
 (17.711698094551632, True),
 (17.66085611179869, True),
 (17.659320219904277, True),
 (17.619475342962705, True),
 (17.610308661495996, True),
 (17.589094275036246, True),
 (17.57649078393002, True),
 (17.568425324373603, True),
 (17.542287626319904, True),
 (17.532411195609072, True),
 (17.47353613116214, True),
 (17.467283069326214, True),
 (17.41750048916587, True),
 (17.41461795411205, True),
 (17.403564706523063, True),
 (17.39121356072248, True),
 (17.33936542692742, True),
 (17.322792605102116, True),
 (17.31778087537895, True),
 (17.310266800708785, True),
 (17.307139328898867, True),
 (17.252635153071143, True),
 (17.24899440403226, True),
 (17.21762680005162, True),
 (17.209644117856186, True),
 (17.18447398724227, True),
 (17.15564995190129, True),
 (17.074489044554518, True),
 (17.035153207019658, True),
 (17.026566535189023, True),
 (17.019251481837163, True),
 (17.018737420838125, True),
 (17.01069490243049, True),
 (17.006592978660134, True),
 (17.003061414626565, True),
 (16.944382682210982, True),
 (16.931014866900952, True),
 (16.91118133165382, True),
 (16.907123280768797, True),
 (16.88324011643089, True),
 (16.879251097368982, True),
 (16.870971524182043, True),
 (16.846665261887424, True),
 (16.84280095232452, True),
 (16.837844138861872, True),
 (16.83676092605306, True),
 (16.81799355332036, True),
 (16.814818048498054, True),
 (16.7965242121356, True),
 (16.78300470764228, True),
 (16.778461579015925, True),
 (16.760580838677956, True),
 (16.748940127488254, True),
 (16.735340288988287, True),
 (16.731918016676296, True),
 (16.71608559284422, True),
 (16.713264816945706, True),
 (16.69828143689513, True),
 (16.695704354278906, True),
 (16.684347450878256, True),
 (16.65745024593155, True),
 (16.649620971944845, True),
 (16.63880911584529, True),
 (16.622689901163298, True),
 (16.614938809876534, True),
 (16.586696019776213, True),
 (16.570653392039393, True),
 (16.561167650660828, True),
 (16.56082077901994, True),
 (16.55029266227426, True),
 (16.540721257365202, True),
 (16.5170219676509, True),
 (16.49813637954381, True),
 (16.485481443231944, True),
 (16.475840559137065, True),
 (16.459116328311705, True),
 (16.44950778292877, True),
 (16.44459324921213, True),
 (16.435835512695782, True),
 (16.398304193286574, True),
 (16.37984938706825, True),
 (16.37277729794872, True),
 (16.37140995170522, True),
 (16.349038869581666, True),
 (16.34831068530944, True),
 (16.340562921963755, True),
 (16.33274568191964, True),
 (16.330333583758588, True),
 (16.326390480672007, True),
 (16.30380883464649, True),
 (16.283625458941877, True),
 (16.281572664386108, True),
 (16.26166969163324, True),
 (16.237521528412653, True),
 (16.236009315869346, True),
 (16.235603827919338, True),
 (16.231375876162247, True),
 (16.217636128384466, True),
 (16.20755284205898, True),
 (16.20626053512725, True),
 (16.196802403979824, True),
 (16.193485009439232, True),
 (16.174646454790267, True),
 (16.169989784112023, True),
 (16.15726394353835, True),
 (16.154680070943265, True),
 (16.154127201444275, True),
 (16.142352929840033, True),
 (16.125882844232294, True),
 (16.11686648792945, True),
 (16.115090805459555, True),
 (16.11349516385568, True),
 (16.09912719046871, True),
 (16.08584878535002, True),
 (16.08310610648416, True),
 (16.03526767581407, True),
 (16.001936758900936, True),
 (15.993335020686052, True),
 (15.976680509185677, True),
 (15.946367887294604, True),
 (15.942163209191207, True),
 (15.930697091738656, True),
 (15.926949467829896, True),
 (15.910621780195347, True),
 (15.894100268437077, True),
 (15.893925917828026, True),
 (15.8909139818286, True),
 (15.883978923762426, True),
 (15.870298434183743, True),
 (15.862741956286335, True),
 (15.84210989623971, True),
 (15.826021408748593, True),
 (15.820772053175196, True),
 (15.79183577342212, True),
 (15.786455291273873, True),
 (15.771244779772344, True),
 (15.76897491833299, True),
 (15.76744770893181, True),
 (15.75638293985588, True),
 (15.733205030053707, True),
 (15.732627752890945, True),
 (15.728972461856253, True),
 (15.712504010245608, True),
 (15.705413049146491, True),
 (15.695411172170706, True),
 (15.68288276128931, True),
 (15.678773145757662, True),
 (15.675918983370822, True),
 (15.66974451603833, True),
 (15.669005784724979, True),
 (15.658544450153514, True),
 (15.65213227430497, True),
 (15.633913880572674, True),
 (15.633198738963943, True),
 (15.624200920695356, True),
 (15.617669636325239, True),
 (15.608992002172348, True),
 (15.594920858560236, True),
 (15.589898714226047, True),
 (15.587515117933465, True),
 (15.577891045499491, True),
 (15.571631124713557, True),
 (15.563215791819454, True),
 (15.552862419170651, True),
 (15.544528831191839, True),
 (15.523886790392451, True),
 (15.516095860089614, True),
 (15.513982285982957, True),
 (15.505952198833793, True),
 (15.50398089249992, True),
 (15.49863131466788, True),
 (15.496710452537956, True),
 (15.490753052141745, True),
 (15.476712593097629, True),
 (15.472644340237599, True),
 (15.463658953299921, True),
 (15.460209530495852, True),
 (15.44288267284621, True),
 (15.40158736885461, True),
 (15.396890268819508, True),
 (15.394033993716425, True),
 (15.393865031118166, True),
 (15.387384775205664, True),
 (15.381191172467355, True),
 (15.37391511980731, True),
 (15.3736331550685, True),
 (15.365307119304079, True),
 (15.35417486286632, True),
 (15.34787152415963, True),
 (15.341793186824026, True),
 (15.340885584667046, True),
 (15.339745827702666, True),
 (15.331826871482585, True),
 (15.327304783210112, True),
 (15.325898804799996, True),
 (15.323089298352416, True),
 (15.30903952862422, True),
 (15.306120179680564, True),
 (15.299133799941131, True),
 (15.285525077402847, True),
 (15.26784283929897, True),
 (15.266351703779454, True),
 (15.25719029590797, True),
 (15.250719334355578, True),
 (15.250296643468387, True),
 (15.244307705566168, True),
 (15.243162095473318, True),
 (15.241278988500303, True),
 (15.233663679183646, True),
 (15.233283831176523, True),
 (15.227679873455402, True),
 (15.225763802717092, True),
 (15.21503405188833, True),
 (15.212014904672154, True),
 (15.209778435691844, True),
 (15.207290334319833, True),
 (15.200577548722098, True),
 (15.199702028429458, True),
 (15.198717420667856, True),
 (15.19467977496589, True),
 (15.19435318369664, True),
 (15.192500307694027, True),
 (15.191803233558572, True),
 (15.191802308642778, True),
 (15.191727032008425, True),
 (15.177024958153032, True),
 (15.176994088855468, True),
 (15.176012262048852, True),
 (15.168980781255945, True),
 (15.166780382839095, True),
 (15.16668698096528, True),
 (15.156229881300009, True),
 (15.152261566526033, True),
 (15.146692718654995, True),
 (15.135341279256844, True),
 (15.135202856762648, True),
 (15.126273181030975, True),
 (15.122391316368779, True),
 (15.113039005357095, True),
 (15.082151490274617, True),
 (15.074768541185316, True),
 (15.073693137847812, True),
 (15.0636599571759, True),
 (15.059817773232169, True),
 (15.056638530794254, True),
 (15.05272225542179, True),
 (15.051128343340475, True),
 (15.043921561348293, True),
 (15.028565474403011, True),
 (15.021374916079061, True),
 (15.01504390683608, True),
 (15.007127051584286, True),
 (14.997613545496295, True),
 (14.994715310048504, True),
 (14.99102544497685, True),
 (14.990865492380356, True),
 (14.989995917599034, True),
 (14.986311434062957, True),
 (14.977456088008825, True),
 (14.968875356235678, True),
 (14.962222163195998, True),
 (14.94879934431659, True),
 (14.946468882474791, True),
 (14.929586776598924, True),
 (14.927547148760507, True),
 (14.920294576896461, True),
 (14.898669517416069, True),
 (14.898612458383292, True),
 (14.892394996228083, True),
 (14.890277852299047, True),
 (14.889292407220882, True),
 (14.882187941999726, True),
 (14.87787688846316, True),
 (14.866362036702794, True),
 (14.850153104301203, True),
 (14.81549338186499, True),
 (14.81270641006873, True),
 (14.812477702109307, True),
 (14.792265469549502, True),
 (14.785536506914331, True),
 (14.78063429077775, True),
 (14.780534815338907, True),
 (14.776164196275454, True),
 (14.771899741347303, True),
 (14.767029050638525, True),
 (14.75260883097967, True),
 (14.750464721198655, True),
 (14.741036980037117, True),
 (14.738245936326189, True),
 (14.712083852846453, True),
 (14.708378092520709, True),
 (14.698788654621634, True),
 (14.677802304460675, True),
 (14.676807395025556, True),
 (14.674459872200616, True),
 (14.672998700357036, True),
 (14.67018339787261, True),
 (14.669275211989754, True),
 (14.653874235234268, True),
 (14.650831200130522, True),
 (14.643713904802192, True),
 (14.638376734036397, True),
 (14.636379621884048, True),
 (14.633371920163723, True),
 (14.631598417022085, True),
 (14.626454229899771, True),
 (14.623555486869638, True),
 (14.622008639808309, True),
 (14.615663683294526, True),
 (14.611921198615049, True),
 (14.604271947457447, True),
 (14.592318429535242, True),
 (14.587817139467708, True),
 (14.584256008413242, True),
 (14.577431905615859, True),
 (14.575838522595594, True),
 (14.573964793700764, True),
 (14.567972042475311, True),
 (14.559619803443193, True),
 (14.553312417696427, True),
 (14.551157642111173, True),
 (14.550728017204728, True),
 (14.550299122116183, True),
 (14.549676313206128, True),
 (14.549351762721125, True),
 (14.546776295389018, True),
 (14.544025507495173, True),
 (14.531303782712456, True),
 (14.522960305042702, True),
 (14.514633875391842, True),
 (14.50794290661194, True),
 (14.507244971758015, True),
 (14.497912630888372, True),
 (14.492876886243288, True),
 (14.48976033991657, True),
 (14.48499935108043, True),
 (14.484067070432692, True),
 (14.480038491520306, True),
 (14.474133889512828, True),
 (14.4721745806451, True),
 (14.470840702187562, True),
 (14.470240510184967, True),
 (14.466636207018555, True),
 (14.465570987053903, True),
 (14.460467911179192, True),
 (14.460203443264597, True),
 (14.457138024606298, True),
 (14.453190084122207, True),
 (14.449382501616965, True),
 (14.448096412483883, True),
 (14.446782005499388, True),
 (14.433850213508213, True),
 (14.4332195447294, True),
 (14.426940638257305, True),
 (14.426483933858245, True),
 (14.424729369198657, True),
 (14.42128991134751, True),
 (14.408739409620638, True),
 (14.405200697196024, True),
 (14.402672792419072, True),
 (14.397516904413827, True),
 (14.39605840197309, True),
 (14.394259111147235, True),
 (14.384546845814375, True),
 (14.381907101505702, True),
 (14.38003886427897, True),
 (14.377356291681748, True),
 (14.368772352570792, True),
 (14.36617668623492, True),
 (14.363697634306924, True),
 (14.355274973463091, True),
 (14.352773944298312, True),
 (14.349128462041376, True),
 (14.347829232412817, True),
 (14.34733773583652, True),
 (14.342785697052586, True),
 (14.341978411261117, True),
 (14.326786339878453, True),
 (14.318924032092719, True),
 (14.315814825292877, True),
 (14.310787888179192, True),
 (14.306138738364348, True),
 (14.302975952667927, True),
 (14.298188617949515, True),
 (14.297625613367606, True),
 (14.294373465314495, True),
 (14.29412458478894, True),
 (14.290685796331289, True),
 (14.287518086937032, True),
 (14.28419915382637, True),
 (14.28111497988606, True),
 (14.277239778171664, True),
 (14.276395350854967, True),
 (14.27367747290772, True),
 (14.273082682462368, True),
 (14.273067306273544, True),
 (14.270818703440773, True),
 (14.264051429594938, True),
 (14.263699701855638, True),
 (14.251399864842936, True),
 (14.245691669685195, True),
 (14.234717708359998, True),
 (14.231559970703872, True),
 (14.22666546607839, True),
 (14.223713497267822, True),
 (14.216665433504378, True),
 (14.215368024421581, True),
 (14.211684754489452, True),
 (14.197845776259511, True),
 (14.192743771253577, True),
 (14.189405230881318, True),
 (14.182498774342688, True),
 (14.181027479158306, True),
 (14.168170507150004, True),
 (14.157272012495797, True),
 (14.156860974879303, True),
 (14.156145984635366, True),
 (14.152244315896281, True),
 (14.151880948228259, True),
 (14.14819397715587, True),
 (14.146866843292944, True),
 (14.146302485793536, True),
 (14.140144113751896, True),
 (14.136413749447833, True),
 (14.135240905396516, True),
 (14.132963990571026, True),
 (14.130025720925763, True),
 (14.114622175665694, True),
 (14.098461331940038, True),
 (14.096680085814699, True),
 (14.093175281666527, True),
 (14.091513625956285, True),
 (14.089753645854168, True),
 (14.081224758999808, True),
 (14.076644176433836, True),
 (14.072882515214088, True),
 (14.070516574749211, True),
 (14.070170681379844, True),
 (14.070143398121047, True),
 (14.063430448931458, True),
 (14.060590695214092, True),
 (14.054344878198597, True),
 (14.052240958873677, True),
 (14.050471213099186, True),
 (14.050335692370133, True),
 (14.046575052858016, True),
 (14.045215665167166, True),
 (14.04240303205834, True),
 (14.038721997393202, True),
 (14.03749316386932, True),
 (14.037162991664246, True),
 (14.026843644074484, True),
 (14.026436867580239, True),
 (14.010441658592942, True),
 (14.010219834199216, True),
 (14.00647712542405, True),
 (13.99892584660808, True),
 (13.998632805877403, True),
 (13.997056253760713, True),
 (13.992975822039528, True),
 (13.989601902709458, True),
 (13.979805574345884, True),
 (13.97474090494829, True),
 (13.95157197268774, True),
 (13.946530220559572, True),
 (13.942438126791636, True),
 (13.940327752664311, True),
 (13.93119410443184, True),
 (13.924373355055929, True),
 (13.923326190301532, True),
 (13.922935638479592, True),
 (13.918294487187387, True),
 (13.916409319409942, True),
 (13.896077011926504, True),
 (13.888022151532866, True),
 (13.885310273131493, True),
 (13.88176975437391, True),
 (13.876164047641598, True),
 (13.87315217616415, True),
 (13.861563859405882, True),
 (13.855617874975058, True),
 (13.85239753696707, True),
 (13.850457579871192, True),
 (13.845583920954935, True),
 (13.842542842557211, True),
 (13.841175248663914, True),
 (13.839442435109438, True),
 (13.839057314868105, True),
 (13.837921853762062, True),
 (13.834634049552754, True),
 (13.828671855308885, True),
 (13.828462315551654, True),
 (13.81994275336712, True),
 (13.81454365750569, True),
 (13.812095887328091, True),
 (13.803061676370355, True),
 (13.80077913879097, True),
 (13.797123466367088, True),
 (13.795280159410945, True),
 (13.782962795185524, True),
 (13.7815663057524, True),
 (13.770058725976293, True),
 (13.763913766058591, True),
 (13.761979837278254, True),
 (13.752580335409478, True),
 (13.749187180824999, True),
 (13.748514636403073, True),
 (13.748034623816682, True),
 (13.74243138040749, True),
 (13.738500562615972, True),
 (13.737789590235083, True),
 (13.732202975871541, True),
 (13.731879673445025, True),
 (13.72304538728905, True),
 (13.712836906511054, True),
 (13.70889101466119, True),
 (13.705545341184722, True),
 (13.70380072138992, True),
 (13.702403376721213, True),
 (13.70213626541736, True),
 (13.701921795902974, True),
 (13.69905851554188, True),
 (13.696156070673377, True),
 (13.694164549038602, True),
 (13.692236878955777, True),
 (13.691509367372083, True),
 (13.685335947197915, True),
 (13.685074622662126, True),
 (13.683954443523378, True),
 (13.68231188168056, True),
 (13.67894520927192, True),
 (13.677625579532055, True),
 (13.676229183562528, True),
 (13.671173679205202, True),
 (13.668676629471774, True),
 (13.660611624338996, True),
 (13.647907799859606, True),
 (13.646386338096542, True),
 (13.64567034094325, True),
 (13.64420715884971, True),
 (13.641445324117258, True),
 (13.636580032654786, True),
 (13.633812851447217, True),
 (13.632507063668466, True),
 (13.632265210698426, True),
 (13.625228235034994, True),
 (13.624065814067622, True),
 (13.61576554538159, True),
 (13.603225314827446, True),
 (13.601979227561431, True),
 (13.601705125340645, True),
 (13.599494284736235, True),
 (13.596327666628918, True),
 (13.594985273280145, True),
 (13.590930052474286, True),
 (13.586935925108454, True),
 (13.582239768492878, True),
 (13.582000648477022, True),
 (13.58137096329738, True),
 (13.578445306516576, True),
 (13.5777321440148, True),
 (13.57735801908005, True),
 (13.576659106295134, True),
 (13.572672512181214, True),
 (13.562781575775686, True),
 (13.540351922619607, True),
 (13.537161273001155, True),
 (13.534496276690364, True),
 (13.529829150174132, True),
 (13.528858394418124, True),
 (13.528852546088386, True),
 (13.528383846202226, True),
 (13.528284356278638, True),
 (13.523531696355398, True),
 (13.522530668748102, True),
 (13.522222409956543, True),
 (13.521296563347569, True),
 (13.520629735180712, True),
 (13.52051146500706, True),
 (13.519532783011012, True),
 (13.519360596151023, True),
 (13.512897930322188, True),
 (13.501617943462408, True),
 (13.500235695886825, True),
 (13.486433858710937, True),
 (13.485699423931079, True),
 (13.485340673268768, True),
 (13.48254624655956, True),
 (13.482242290767216, True),
 (13.480826388228314, True),
 (13.48000537420424, True),
 (13.479512771884076, True),
 (13.477277312088406, True),
 (13.474979768022086, True),
 (13.472820752139814, True),
 (13.466008420170462, True),
 (13.460871386222612, True),
 (13.460433773267258, True),
 (13.459250803996044, True),
 (13.458871072661667, True),
 (13.454032715961729, True),
 (13.453155453072602, True),
 (13.450941116024934, True),
 (13.450581875446144, True),
 (13.442972107125195, True),
 (13.442713790915723, True),
 (13.442349043867102, True),
 (13.435519994787917, True),
 (13.4324487396521, True),
 (13.431571283177059, True),
 (13.429966822213801, True),
 (13.427205424734264, True),
 (13.424966486563067, True),
 (13.42042754012797, True),
 (13.41926984968299, True),
 (13.41501900285275, True),
 (13.412818957547573, True),
 (13.404158491965058, True),
 (13.401679701314569, True),
 (13.400663702353794, True),
 (13.400018254645333, True),
 (13.398366784362633, True),
 (13.397562352078939, True),
 (13.391993939418686, True),
 (13.391876705220135, True),
 (13.389927152552284, True),
 (13.389715187248987, True),
 (13.388942011850075, True),
 (13.388379020192781, True),
 (13.385162020163209, True),
 (13.385036766091712, True),
 (13.383471744647169, True),
 (13.383364071211, True),
 (13.382878089578417, True),
 (13.382122670013727, True),
 (13.38200176571686, True),
 (13.381388180669328, True),
 (13.381161638373955, True),
 (13.37991563365705, True),
 (13.374410338986902, True),
 (13.368334054393534, True),
 (13.36677275824366, True),
 (13.363576574793724, True),
 (13.361714323514642, True),
 (13.361559298377225, True),
 (13.361466485762909, True),
 (13.358765257819739, True),
 (13.350567824193952, True),
 (13.350069063491008, True),
 (13.342488570216991, True),
 (13.342470521325767, True),
 (13.338768167957175, True),
 (13.333837852057519, True),
 (13.32764584164225, True),
 (13.32552435700683, True),
 (13.325476897729102, True),
 (13.304119377069648, True),
 (13.298721610006845, True),
 (13.294330998706483, True),
 (13.293854450107407, True),
 (13.287841287957393, True),
 (13.28594244616779, True),
 (13.28485839005698, True),
 (13.283081534947913, True),
 (13.278685752573407, True),
 (13.274229878518907, True),
 (13.27422746677432, True),
 (13.273290902230412, True),
 (13.272058460532092, True),
 (13.264328826329438, True),
 (13.262723359394636, True),
 (13.261608081348427, True),
 (13.25811734599778, True),
 (13.256319093722555, True),
 (13.254263165737912, True),
 (13.252065147559302, True),
 (13.246754824468217, True),
 (13.24658712782736, True),
 (13.236411508916895, True),
 (13.234676344292867, True),
 (13.234675397794364, True),
 (13.2341096649049, True),
 (13.233158796242172, True),
 (13.231501191713576, True),
 (13.223683508147607, True),
 (13.223288356925002, True),
 (13.220981267853128, True),
 (13.218875882838342, True),
 (13.218736932227678, True),
 (13.216677189828845, True),
 (13.216510995248756, True),
 (13.21109880272564, True),
 (13.210601586857436, True),
 (13.21044527736413, True),
 (13.207372176883533, True),
 (13.20397098111855, True),
 (13.202165621238885, True),
 (13.200364681728786, True),
 (13.199425887140995, True),
 (13.19708240298586, True),
 (13.196167235588062, True),
 (13.195243696880512, True),
 (13.194911903436198, True),
 (13.19464313385728, True),
 (13.192762251460099, True),
 (13.191908989780815, True),
 (13.19145176939046, True),
 (13.19069819549121, True),
 (13.184751566726348, True),
 (13.181749472296291, True),
 (13.181007050671871, True),
 (13.178817585814018, True),
 (13.169130691783844, True),
 (13.167024416800789, True),
 (13.16624653349878, True),
 (13.165661205301808, True),
 (13.159983833333577, True),
 (13.156066078596913, True),
 (13.155719675129713, True),
 (13.154711092758495, True),
 (13.154518078985648, True),
 (13.153015356482884, True),
 (13.147888732100323, True),
 (13.145291357708617, True),
 (13.144801486648566, True),
 (13.142300263854763, True),
 (13.142108651977926, True),
 (13.139805197582666, True),
 (13.138322532774895, True),
 (13.135236811514174, True),
 (13.13475051857502, True),
 (13.131366570850739, True),
 (13.117542100591274, True),
 (13.117155990186507, True),
 (13.116925449236343, True),
 (13.116381881655697, True),
 (13.114078209951495, True),
 (13.112912503828882, True),
 (13.102693249356866, True),
 (13.1026665911609, True),
 (13.101178866728587, True),
 (13.10110328409653, True),
 (13.094989581997018, True),
 (13.093292818913328, True),
 (13.0906437205267, True),
 (13.083846683174063, True),
 (13.083343089629919, True),
 (13.082002790676595, True),
 (13.080251230566747, True),
 (13.078684895127044, True),
 (13.077758041834917, True),
 (13.073633825498439, True),
 (13.072308716190152, True),
 (13.069323694377234, True),
 (13.067716306023238, True),
 (13.06648655165615, True),
 (13.064000210518916, True),
 (13.062812750165412, True),
 (13.060013250221944, True),
 (13.05720290945862, True),
 (13.056644306751917, True),
 (13.055344551029453, True),
 (13.054123769801544, True),
 (13.053713095461205, True),
 (13.053333376615312, True),
 (13.049171683470112, True),
 (13.046967433411927, True),
 (13.045182085799285, True),
 (13.043700302731105, True),
 (13.039315217746198, True),
 (13.03645794342126, True),
 (13.034671798274141, True),
 (13.032288064026973, True),
 (13.028162797665793, True),
 (13.026916133211323, True),
 (13.025987764203771, True),
 (13.025716549855185, True),
 (13.025607171596915, True),
 (13.025525711055687, True),
 (13.02072473395716, True),
 (13.01970828761034, True),
 (13.018400765662962, True),
 (13.017021146595829, True),
 (13.015161905565291, True),
 (13.014369373772833, True),
 (13.012390293107842, True),
 (13.011499336234628, True),
 (13.008928960929715, True),
 (13.005422280940515, True),
 (13.003948573242555, True),
 (13.003304730292895, True),
 (12.999580745398966, True),
 (12.999548594661915, True),
 (12.997644415786079, True),
 (12.99657175178991, True),
 (12.995690564266974, True),
 (12.991474534076485, True),
 (12.989807313544736, True),
 (12.986861658434853, True),
 (12.982811498144985, True),
 (12.981902964134163, True),
 (12.981244149171825, True),
 (12.980441745629056, True),
 (12.980441745629056, True),
 (12.980250186053032, True),
 (12.97990297041317, True),
 (12.974212554126359, True),
 (12.97150312417952, True),
 (12.961074121846035, True),
 (12.959199619779728, True),
 (12.959160838376638, True),
 (12.958365583609638, True),
 (12.956536830726288, True),
 (12.954252503114116, True),
 (12.95340174910204, True),
 (12.95210227608902, True),
 (12.949792111935862, True),
 (12.948420574411424, True),
 (12.948408074043188, True),
 (12.945813640688703, True),
 (12.94319045376078, True),
 (12.941442494250728, True),
 (12.940198292592628, True),
 (12.939003999768577, True),
 (12.93546619843497, True),
 (12.935129792116237, True),
 (12.933004095985156, True),
 (12.932391171918324, True),
 (12.927294846754501, True),
 (12.923268190300814, True),
 (12.916354387064075, True),
 (12.915781357489875, True),
 (12.914549569174213, True),
 (12.914543291177418, True),
 (12.908466050652828, True),
 (12.908226149615068, True),
 (12.907070699740965, True),
 (12.906511300911502, True),
 (12.906255030832268, True),
 (12.904993557252935, True),
 (12.904989854599913, True),
 (12.898904539966411, True),
 (12.898425580987427, True),
 (12.895459736042985, True),
 (12.893895151908108, True),
 (12.88696498513054, True),
 (12.886631422016679, True),
 (12.88654427824634, True),
 (12.885832697153933, True),
 (12.884603582732469, True),
 (12.884545992135303, True),
 (12.884070142844754, True),
 (12.879830459414043, True),
 (12.879501266202409, True),
 (12.879277404934431, True),
 (12.876381049197894, True),
 (12.875816070184312, True),
 (12.874362783090636, True),
 (12.87406998279179, True),
 (12.873933676362414, True),
 (12.868998427521465, True),
 (12.86785439417581, True),
 (12.864929216282647, True),
 (12.86489172855319, True),
 (12.86483428849788, True),
 (12.864078185984457, True),
 (12.863647929897967, True),
 (12.863140052588347, True),
 (12.86152222743294, True),
 (12.859408254703242, True),
 (12.85882908429606, True),
 (12.858561212313983, True),
 (12.856175249681101, True),
 (12.847612216054339, True),
 (12.846462482069684, True),
 (12.841114456882655, True),
 (12.839634713287527, True),
 (12.838544198660996, True),
 (12.837988199565812, True),
 (12.837962761543642, True),
 (12.83746103886172, True),
 (12.836268993628766, True),
 (12.831009481067884, True),
 (12.82832431606975, True),
 (12.827783030908106, True),
 (12.827206657119428, True),
 (12.826439071943255, True),
 (12.826211933985869, True),
 (12.823853729785423, True),
 (12.823137952245878, True),
 ...]

Once we've obtained the sorted labels, we can discard the confidences - only the labels themselves matter in terms of our evaluation metrics.

In [53]:
labelsRankedByConfidence = [z[1] for z in confidencesAndLabels]
In [54]:
labelsRankedByConfidence
Out[54]:
[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 ...]

Precision@K

Precision@K measures which of the Top K entries in our ranked list actually had positive labels:

In [55]:
def precisionAtK(K, y_sorted):
    return sum(y_sorted[:K]) / K

Recall@K

While Recall@K measures how many out of all positively labeled instances we returned among the Top K:

In [56]:
def recallAtK(K, y_sorted):
    return sum(y_sorted[:K]) / sum(y_sorted)
In [57]:
precisionAtK(50, labelsRankedByConfidence)
Out[57]:
1.0
In [58]:
precisionAtK(1000, labelsRankedByConfidence)
Out[58]:
1.0
In [59]:
precisionAtK(10000, labelsRankedByConfidence)
Out[59]:
0.998
In [60]:
recallAtK(50, labelsRankedByConfidence)
Out[60]:
0.0003582483090679812
In [61]:
recallAtK(1000, labelsRankedByConfidence)
Out[61]:
0.007164966181359624
In [62]:
recallAtK(10000, labelsRankedByConfidence)
Out[62]:
0.07150636248996904

Training / validation / test pipeline

To combine our ideas about training, evaluation, regularization, and overfitting, we'll next try to implement a complete training, testing, and validation pipeline.

We start by importing libraries and reading in our data, just as before:

In [63]:
import gzip
from collections import defaultdict
import string
import random
In [64]:
path = "/home/jmcauley/datasets/mooc/amazon/amazon_reviews_us_Gift_Card_v1_00.tsv.gz"
In [65]:
f = gzip.open(path, 'rt', encoding="utf8")
In [66]:
header = f.readline()
header = header.strip().split('\t')
In [67]:
dataset = []
In [68]:
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

And again we build word featres based on the 1,000 most common words

In [69]:
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

for d in dataset:
  r = ''.join([c for c in d['review_body'].lower() if not c in punctuation])
  for w in r.split():
    wordCount[w] += 1

counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

words = [x[1] for x in counts[:1000]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)
In [70]:
def feature(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in words:
            feat[wordId[w]] += 1
    feat.append(1) #offset
    return feat

Building the validation set

Again we split our data, but this time we split it into three parts - a training, a validation, and a test component.

In [71]:
random.shuffle(dataset)
In [72]:
X = [feature(d) for d in dataset]
In [73]:
y = [d['star_rating'] for d in dataset]

In this example we use half of our data (and labels) for training, the next quarter for validation, and the final quarter for testing.

In [74]:
N = len(X)
X_train = X[:N//2]
X_valid = X[N//2:3*N//4]
X_test = X[3*N//4:]
y_train = y[:N//2]
y_valid = y[N//2:3*N//4]
y_test = y[3*N//4:]
In [75]:
len(X), len(X_train), len(X_valid), len(X_test)
Out[75]:
(149086, 74543, 37271, 37272)

Again we'll train a model based on regularized (Ridge) regression.

In [76]:
from sklearn import linear_model
In [77]:
help(linear_model.Ridge)
Help on class Ridge in module sklearn.linear_model.ridge:

class Ridge(_BaseRidge, sklearn.base.RegressorMixin)
 |  Linear least squares with l2 regularization.
 |  
 |  This model solves a regression model where the loss function is
 |  the linear least squares function and regularization is given by
 |  the l2-norm. Also known as Ridge Regression or Tikhonov regularization.
 |  This estimator has built-in support for multi-variate regression
 |  (i.e., when y is a 2d-array of shape [n_samples, n_targets]).
 |  
 |  Read more in the :ref:`User Guide <ridge_regression>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : {float, array-like}, shape (n_targets)
 |      Small positive values of alpha improve the conditioning of the problem
 |      and reduce the variance of the estimates.  Alpha corresponds to
 |      ``C^-1`` in other linear models such as LogisticRegression or
 |      LinearSVC. If an array is passed, penalties are assumed to be specific
 |      to the targets. Hence they must correspond in number.
 |  
 |  copy_X : boolean, optional, default True
 |      If True, X will be copied; else, it may be overwritten.
 |  
 |  fit_intercept : boolean
 |      Whether to calculate the intercept for this model. If set
 |      to false, no intercept will be used in calculations
 |      (e.g. data is expected to be already centered).
 |  
 |  max_iter : int, optional
 |      Maximum number of iterations for conjugate gradient solver.
 |      For 'sparse_cg' and 'lsqr' solvers, the default value is determined
 |      by scipy.sparse.linalg. For 'sag' solver, the default value is 1000.
 |  
 |  normalize : boolean, optional, default False
 |      If True, the regressors X will be normalized before regression.
 |  
 |  solver : {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag'}
 |      Solver to use in the computational routines:
 |  
 |      - 'auto' chooses the solver automatically based on the type of data.
 |  
 |      - 'svd' uses a Singular Value Decomposition of X to compute the Ridge
 |        coefficients. More stable for singular matrices than
 |        'cholesky'.
 |  
 |      - 'cholesky' uses the standard scipy.linalg.solve function to
 |        obtain a closed-form solution.
 |  
 |      - 'sparse_cg' uses the conjugate gradient solver as found in
 |        scipy.sparse.linalg.cg. As an iterative algorithm, this solver is
 |        more appropriate than 'cholesky' for large-scale data
 |        (possibility to set `tol` and `max_iter`).
 |  
 |      - 'lsqr' uses the dedicated regularized least-squares routine
 |        scipy.sparse.linalg.lsqr. It is the fatest but may not be available
 |        in old scipy versions. It also uses an iterative procedure.
 |  
 |      - 'sag' uses a Stochastic Average Gradient descent. It also uses an
 |        iterative procedure, and is often faster than other solvers when
 |        both n_samples and n_features are large. Note that 'sag' fast
 |        convergence is only guaranteed on features with approximately the
 |        same scale. You can preprocess the data with a scaler from
 |        sklearn.preprocessing.
 |  
 |      All last four solvers support both dense and sparse data. However,
 |      only 'sag' supports sparse input when `fit_intercept` is True.
 |  
 |      .. versionadded:: 0.17
 |         Stochastic Average Gradient descent solver.
 |  
 |  tol : float
 |      Precision of the solution.
 |  
 |  random_state : int seed, RandomState instance, or None (default)
 |      The seed of the pseudo random number generator to use when
 |      shuffling the data. Used in 'sag' solver.
 |  
 |      .. versionadded:: 0.17
 |         *random_state* to support Stochastic Average Gradient.
 |  
 |  Attributes
 |  ----------
 |  coef_ : array, shape (n_features,) or (n_targets, n_features)
 |      Weight vector(s).
 |  
 |  intercept_ : float | array, shape = (n_targets,)
 |      Independent term in decision function. Set to 0.0 if
 |      ``fit_intercept = False``.
 |  
 |  n_iter_ : array or None, shape (n_targets,)
 |      Actual number of iterations for each target. Available only for
 |      sag and lsqr solvers. Other solvers will return None.
 |  
 |  See also
 |  --------
 |  RidgeClassifier, RidgeCV, KernelRidge
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.linear_model import Ridge
 |  >>> import numpy as np
 |  >>> n_samples, n_features = 10, 5
 |  >>> np.random.seed(0)
 |  >>> y = np.random.randn(n_samples)
 |  >>> X = np.random.randn(n_samples, n_features)
 |  >>> clf = Ridge(alpha=1.0)
 |  >>> clf.fit(X, y) # doctest: +NORMALIZE_WHITESPACE
 |  Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
 |        normalize=False, random_state=None, solver='auto', tol=0.001)
 |  
 |  Method resolution order:
 |      Ridge
 |      _BaseRidge
 |      abc.NewBase
 |      sklearn.linear_model.base.LinearModel
 |      abc.NewBase
 |      sklearn.base.BaseEstimator
 |      sklearn.base.RegressorMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit Ridge regression model
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = [n_samples, n_features]
 |          Training data
 |      
 |      y : array-like, shape = [n_samples] or [n_samples, n_targets]
 |          Target values
 |      
 |      sample_weight : float or numpy array of shape [n_samples]
 |          Individual weights for each sample
 |      
 |      Returns
 |      -------
 |      self : returns an instance of self.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset([])
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.linear_model.base.LinearModel:
 |  
 |  decision_function(*args, **kwargs)
 |      DEPRECATED:  and will be removed in 0.19.
 |      
 |      Decision function of the linear model.
 |      
 |              Parameters
 |              ----------
 |              X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |                  Samples.
 |      
 |              Returns
 |              -------
 |              C : array, shape = (n_samples,)
 |                  Returns predicted values.
 |  
 |  predict(self, X)
 |      Predict using the linear model
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape = (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape = (n_samples,)
 |          Returns predicted values.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __repr__(self)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep: boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The former have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.RegressorMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the coefficient of determination R^2 of the prediction.
 |      
 |      The coefficient R^2 is defined as (1 - u/v), where u is the regression
 |      sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual
 |      sum of squares ((y_true - y_true.mean()) ** 2).sum().
 |      Best possible score is 1.0 and it can be negative (because the
 |      model can be arbitrarily worse). A constant model that always
 |      predicts the expected value of y, disregarding the input features,
 |      would get a R^2 score of 0.0.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True values for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          R^2 of self.predict(X) wrt. y.

Our MSE function is the same as before, but this time for convience takes an (already trained) model as an input parameter.

In [78]:
def MSE(model, X, y):
    predictions = model.predict(X)
    differences = [(a-b)**2 for (a,b) in zip(predictions, y)]
    return sum(differences) / len(differences)

Finally, to implement the pipeline, we (1) iterate through various values of lambda; (2) Fit a ridge regression model for each of these values; (3) Evaluate the performance of this model on the validation set; (4) Keep track of which model is the best we've seen so far (on the validation set);

In [79]:
bestModel = None
bestMSE = None
In [80]:
for lamb in [0.01, 0.1, 1, 10, 100]:
    model = linear_model.Ridge(lamb, fit_intercept=False)
    model.fit(X_train, y_train)
    
    mseTrain = MSE(model, X_train, y_train)
    mseValid = MSE(model, X_valid, y_valid)
    
    print("lambda = " + str(lamb) + ", training/validation error = " +
          str(mseTrain) + '/' + str(mseValid))
    if not bestModel or mseValid < bestMSE:
        bestModel = model
        bestMSE = mseValid
lambda = 0.01, training/validation error = 0.425736288581991/0.44558206040397835
lambda = 0.1, training/validation error = 0.42573629778039757/0.44557008375784823
lambda = 1, training/validation error = 0.42573720762523615/0.44545205378945923
lambda = 10, training/validation error = 0.42581916084972155/0.4444289487678403
lambda = 100, training/validation error = 0.4296874130907892/0.4417827155266739

Finally, we evaluate the performance of our best model on the test set. Note that this is the only time throughout the entire pipeline that we examine the test data.

In [81]:
mseTest = MSE(bestModel, X_test, y_test)
print("test error = " + str(mseTest))
test error = 0.4400519282444615