Department of Computer Science and Engineering
University of California, San Diego
CSE 250B
Fall 2008

Assignment 4

DUE AT THE START OF CLASS ON THURSDAY DECEMBER 4, 2008


Important: The official deadline for this project is the last day of classes, but you may hand it in any time up to and including the start of the final exam, which is at 3pm on Thursday December 11.


The objective of this project is to understand Gibbs sampling as a training method for latent Dirichlet allocation (LDA) models.  LDA is a generative probabilistic model for text documents that are represented in the "bag-of-words" format.  This means that each document is viewed as a vector of counts, one count for each word in the vocabulary.  The number of dimensions d is the number of words in the vocabulary.  Usually d is much bigger than the number of documents, and also bigger than the length of each document.  

What to do:  First, implement the Gibbs sampling training algorithm.  Second, for an LDA model trained on a collection of documents, write code to print the most likely words for each topic.  Third, write code to create a visualization of the documents in a 2D or 3D space defined by the topics of the trained model.  Using Matlab is recommended, since it provides graphics functions that you can take advantage of.  If you use Matlab, making the inner loop of your LDA code fast is a challenge, but doable.

Do experiments with two datasets.  The first is the Classic400 dataset, which is available here in Matlab format.  It consists of 400 documents over a vocabulary of 6205 words.  Show that the learned topics are meaningful for this corpus, and that the visualization method gives meaningful results. (To show that results are meaningful, compare with the human-assigned group label of each document.)

The second dataset should be one that you select and obtain and preprocess.  This dataset may be an interesting collection of documents.  Or, for some extra credit, find a non-text dataset for which the LDA model is appropriate, and explain why.  In any case, use your code to train LDA on the dataset, display a selection of results concisely, and explain why these results are interesting.

In your report, try to answer the following questions.  (The questions are in no particular order, are related to each other, and do not all have clear or easy answers.)
(1) Can you compute the log likelihood of the training data according to the trained model?
(2) How can you decide which of two LDA models is better for the same data?
(3) What is a good way of picking values for the hyperparameters alpha and beta?
(4) How can you measure whether an LDA model matches well a human-specified organization for a set of documents?
(5) How can you determine whether an LDA model is overfitting its training data?