Word burstiness, the DCMLDA topic model, and perplexity

If a term is used once in a document, then it is likely to be used again. This phenomenon is called burstiness, and it implies that the second and later appearances of a word are less significant than the first appearance. However, most topic models do not model burstiness. Because the DCMLDA model does account for burstiness, it tends to achieve better goodness of fit with fewer topics than standard latent Dirichlet allocation (LDA).

This advantage of DCMLDA has been confirmed independently recently in the paper Probabilistic topic models for sequence data by Nicola Barbieri, Giuseppe Manco, Ettore Ritacco, Marco Carnuccio, Antonio Bevacqua, published in the journal Machine Learning and available at http://link.springer.com/article/10.1007%2Fs10994-013-5391-2. Here is Figure 2a from this paper, reproduced with permission from the authors: Perplexity as a function of number of
topics, for various topic models including DCMLDA

Perplexity as a function of number of
topics, for various topic models including DCMLDA

In the figure, perplexity is a measure of goodness of fit based on held-out test data. Lower perplexity is better. Compared to four other topic models, DCMLDA (blue line) achieves the lowest perplexity. Also, it is the only method that suggests a reasonable optimal number of topics. For this text collection, 40 topics provide a better fit than 20 or 60 topics. Other topic models tend to always show better fit with more topics, which means that there is no natural number of topics to choose.

The original paper on DCMLDA is Accounting for Word Burstiness in Topic Models by G. Doyle and C. Elkan, published at ICML in 2009 pdf. Source code for DCMLDA written in Matlab is available in the file dcmldacode.zip. The file is 8 megabytes approximately because it contains several datasets used for experiments.

Word burstiness, topic models, and perplexity