Data is available at Download and save to your own directory

Basic fairness measurements - interaction and recommendation distributions

Goodreads graphic novel data. Train a standard recommender, and compare generated recommendations to historical interactions, in terms of frequency distributions.

Problem setup follows the BPR library from "implicit" (see Chapter 5), but could be completed with any recommender

Fit the model

Extract recommendations (example)

Next we extract recommendations for all users, and compare these to their interaction histories. First collect histories from historical trends:

Next build similar data structures containing recommendations for each user

For each user, generate a set of recommendations equivalent in size to their number of interactions used for training

So far our data structures just contain lists of historical interactions and recommendations for each user. Convert these into counts for each item. This is done both for interactions (I) and recommendations (R).

Sort counts by popularity to generate plots

Collect the information for interactions and recommendations for plotting

Gini goefficient

The two implementations below compute the gini coefficient either by comparing all pairs, or by doing so for a given number of samples

Compute the gini coefficients of interactions versus distributions for the two distributions computed in the experiments above

Average cosine similarity between interactions versus recommendations

Given a set of items, measure the average cosine distance between them (by taking a sample of pairs, similar to our implementation of the gini coefficient). This can be used as a rough measure of the diversity of a set of recommendations.

Compute the average cosine similarity among interactions from a particular user

Compute the same quantity across a large sample of recommendations and interactions for several users, as an aggregate measure of recommendation versus interaction diversity

A lower average cosine similarity among interactions indicates that they are more diverse compared to recommendations

Fair recommendation

First, implement a latent factor model. Our basic implementation follows the Tensorflow latent factor model from Chapter 5.

Data structures to organize interactions by gender

Latent factor model. The "absoluteFairness" function is new; others are equivalent to our model from Chapter 5.

Same model, but not including the fairness terms

Maximal marginal relevance (MMR)

Similarity between items (in this case cosine)

Find the most similar item among a candidate set

Select a random user to receive recommendations

Define a function to get the next recommendation given an initial list, i.e., the mamimal marginally relevant item. Lambda (lamb) controls the tradeoff between compatibility an diversity.

Before re-ranking, generate a list of compatibility scores, i.e., a ranked list of items for a particular user

Generate a list of recommendations for the user by repeatedly calling the retrieving the maximal marginally relevant recommendation. First, just get the most relevant items (lambda=1) without encouraging diversity.

Note that this implementation is not particularly optimized, and takes several seconds to generate a list of recommendations.

More Recommendations for different relevance/diversity tradeoffs. Note that the tradeoff parameter is quite sensitive to the specific scale of the model parameters.



First just try out a different similarity function (based on the inner product)

Experiment with different relevance/diversity tradeoffs

And plot the results


Concentration effects

Measure concentration in terms of the Gini coefficient

Aplly a small penalty for items that are highly recommended (resulting a reduction of concentration). This is essentially just a simple re-ranking strategy.

Compare interaction data, recommendations, and "corrected" recommendations in terms of concentration

Ultimately we got a model with lower concentration than the original data. Could adjust the penalty term to control the concentration amount.


Measure parity in terms of beer ABV (alcohol level). Do high- (or low-) alcohol items tend to get recommended more than we would expect from interaction data?

Start by training a BPR model (using the implicit library)

Frequency of positive (high alcohol) versus negative (low alcohol) among interactions

Frequency among recommendations (low alcohol items end up being slightly over-recommended)

As in 10.2, correct using a re-ranking strategy with a simple penalty term (in this case encouraging high-alcohol items to be recommended). Again this could be adjusted to achieve the desired calibration.