Department of Computer Science and Engineering
University of California, San Diego
CSE 250B
Fall 2008
Assignment 4
DUE AT THE START OF CLASS ON THURSDAY DECEMBER 4, 2008
Important: The official deadline for this project is the last day of
classes, but you may hand it in any time up to and including
the start of the final exam, which is at 3pm on Thursday December 11.
The objective of this project is to understand Gibbs sampling as a
training method for latent Dirichlet allocation (LDA) models. LDA
is a generative probabilistic model for text documents that are
represented in the "bag-of-words" format. This means that each
document is viewed
as a vector of counts, one count for each word in the
vocabulary. The number of dimensions d is the number
of words in the vocabulary. Usually d is much bigger than the number of documents, and also bigger than the length of each document.
What to do: First,
implement the Gibbs sampling training algorithm. Second, for an
LDA model trained on a collection of documents, write code to print the
most likely words for each topic. Third, write code to
create a visualization of the documents in a 2D or 3D space defined by
the topics of the trained model. Using Matlab is recommended,
since it provides graphics functions that you can take advantage of.
If you use Matlab, making the inner loop of your LDA code fast is
a challenge, but doable.
Do experiments with two datasets. The first is the Classic400 dataset, which is available here
in Matlab format. It consists of 400 documents over a vocabulary
of 6205 words. Show that the learned topics are meaningful for
this corpus, and that the visualization method gives meaningful
results. (To show that results are meaningful, compare with the
human-assigned group label of each document.)
The second dataset should be one that you select and obtain and
preprocess. This dataset may be an interesting collection of
documents. Or, for some extra credit, find a non-text dataset for
which the LDA model is appropriate, and explain why. In any case,
use your code to train LDA on the dataset, display a selection of
results concisely, and explain why these results are interesting.
In your report, try to answer the following questions. (The
questions are in no particular order, are related to each other, and do
not all have clear or easy answers.)
(1) Can you compute the log likelihood of the training data according to the trained model?
(2) How can you decide which of two LDA models is better for the same data?
(3) What is a good way of picking values for the hyperparameters alpha and beta?
(4) How can you measure whether an LDA model matches well a human-specified organization for a set of documents?
(5) How can you determine whether an LDA model is overfitting its training data?