Data is available at Download and save to your own directory.

Regression (lines of best fit)

Read in a small dataset of reviews of fantasy novels (from Goodreads)

What does the data look like? (json format)

To predict ratings from review length, let's start by assembling vectors of both

And generating a quick scatter plot of the data to see overall trends...

To perform regression, convert to features (X) and labels (y)

Fit a linear regression model to the data (sklearn)

Extract the model coefficients (theta)

Same thing using numpy (lstsq)...

Same thing using by computing the pseudoinverse directly

Plot the line of best fit...

Fit a slightly more complex model with two features (review length and number of comments)

Fit the model and extract the coefficients

Mean Squared Error and R^2

Extract model predictions

Sum of squared errors (SSE)

Mean squared error (MSE)

R^2 and fraction of variance unexplained (FVU)

Can also get the R^2 coefficient directly from the library...

Simple feature transformations

Polynomial (quadratic) function

Compute the R^2 coefficient (using the library)

Cubic function

Use a (slightly larger) dataset of fantasy reviews

Extract averages for each day

Plot the averages for each day

Binary (one-hot) features

Read a small dataset of beer reviews with gender attributes

What does the data look like?

Filter the dataset to include only those users who specified gender

How many users have specified gender?

Binary representation of gender attribute

Fit a model to the binary feature

The model indicates that females tend to leave slightly longer reviews than males. Plot data from the two groups.

Transformation of output variables

Dataset of reddit submissions (and resubmissions)

Compute popularity as a function of submission number (i.e., how many times has this identical image been submitted)

Extract a single feature which is just the submission number (0 for first submission, etc., 1 for first resubmission, etc.)

The label to predict is the number of upvotes as a function of submission number

Fit two models: one with the original output variable, one with a log-transformed output variable

Plot data and fitted values for the two models



Simple regression question, same form as the examples above

Extract theta



Fit a model with an additional variable

Extract theta

Compute the MSE

(explanation: the coefficient of length becomes smaller, as the variability in ratings is already largely explained by the number of comments)


Sketch proof:

$\frac{\partial}{\partial \theta_0} \sum_i (\theta_0 - y_i)^2 = 2\sum_i (\theta - y_i)$

equals zero when $\sum_i \theta = \sum_i y_i$, i.e., $\theta = \frac{1}{N} \sum_i y_i$



$\frac{\partial}{\partial \theta_0} \sum_i | \theta_0 - y_i | = \sum_i \delta(y_i > \theta_0) - \sum_i \delta(y_i < \theta_0)$

zero when $\theta_0$ is the median value of $y$ (meaning that the two terms balance out)



$\max_\theta \prod_i \frac{1}{2b}\exp(-\frac{|x_i\cdot \theta - y_i|}{b})$

$= \max_\theta \sum_i -| x_i \cdot \theta - y_i |$

$= \min_\theta \sum_i | x_i \cdot \theta - y_i|$



rewrite $\sum_i (x_i \cdot \theta - y_i)^2$ as $(X\theta - y)^T(X\theta - y)$

$= (\theta^TX^T - y^T)(X\theta - y)$

$= \theta^TX^TX\theta -2y^TX\theta -y^Ty$ (all terms are scalar)

$\frac{\partial}{\partial \theta} (\theta^TX^TX\theta -2y^TX\theta +y^Ty) = 2(\theta^T X^TX - y^TX)$

$= 0$ when $\theta^TX^TX = y^TX$; or when $X^T X \theta = X^Ty$

i.e., when $\theta = (X^TX)^{-1}X^Ty$


Similar to 2.3: when solving $\theta$ by computing $\frac{\partial}{\partial \theta} \sum_i (x_i \cdot \theta - y_i)^2 = 2\sum_i (x_i \cdot \theta - y_i)$, the expression will be minimized when $\sum_i (x_i \cdot \theta - y_i) = 0$