Data is available at http://cseweb.ucsd.edu/~jmcauley/pml/data/. Download and save to your own directory

Bag-of-words models

How many unique words are there?

Ignore capitalization and remove punctuation

With stemming

Just build our feature vector by taking the most popular words (lowercase, punctuation removed, but no stemming)

Sentiment analysis

Extract bag-of-word features. For a bigger dataset, replace this with a sparse matrix to save memory (see examples in Chapter 6)

N-grams

Simple example...

Extract n-grams up to length 5 (same dataset as example above)

A few of our 1000 most popular n-grams. Note the combination of n-grams of different lengths

Same as the model in the previous example above, except using n-grams rather than just unigrams

Some of the most negative and positive n-grams...

TF-IDF

Small set of Goodreads fantasy reviews

For example...

Similar process to extract bag-of-words representations as in previous examples

Document frequency (df)

Term frequency (tf)

Here we extract frequencies for terms in a single specific review

Find the highest tf-idf words in our example review

Cosine similarity

Find the other reviews in the corpus with the highest cosine similarity between tf-idf vectors

word2vec (gensim)

A few utility data structures (used later)

Tokenize the reviews, so that each review becomes a list of words

Example of a tokenized review

Fit the word2vec model

Extract word representation for a particular word

Find similar words to a given query

item2vec

Almost the same as word2vec, but "documents" are made up of sequences of item IDs rather than words

t-SNE embedding

Visualize the embeddings from the model above using t-SNE

Fit a model with just two components for the sake of visualization

Generate scatterplots using the embedded points (one scatter plot per category)

Plot data from a few categories (more interesting with a larger dataset)

Exercises

8.1 / 8.2

Simple sentiment analysis pipeline

Add a couple of "options" for the representation (in this case whether we should convert to lower case, remove puncuation). More could be added.

Condense the pipeline code (see Chapter 3) into a single function

Run models with different feature representation options and dictionary sizes

8.3

Use item2vec to make recommendations (following code from exercises in Chapter 4)

8.4

(see tf-idf retrieval examples above)

8.5

Predict the rating using item2vec item similarity scores. Adapts models from Chapter 4.