If a term is used once in a document,
then it is likely to be used again. This phenomenon is called
burstiness, and it implies that the second and later appearances
of a word are less significant than the first appearance.
However, most topic models do not model burstiness. Because the
DCMLDA model does account for burstiness, it tends to achieve
better goodness of fit with fewer topics than standard latent
Dirichlet allocation (LDA).
This advantage of DCMLDA has been confirmed independently
recently in the paper
Probabilistic topic models for
sequence data by Nicola Barbieri, Giuseppe Manco, Ettore
Ritacco, Marco Carnuccio, Antonio Bevacqua, published in the
journal
Machine Learning and available at
http://link.springer.com/article/10.1007%2Fs10994-013-5391-2.
Here is Figure 2a from this paper, reproduced with permission
from the authors:
In the figure, perplexity is a measure of goodness of fit based on
held-out test data. Lower perplexity is better. Compared to four
other topic models, DCMLDA (blue line) achieves the lowest
perplexity. Also, it is the only method that suggests a reasonable
optimal number of topics. For this text collection, 40 topics
provide a better fit than 20 or 60 topics. Other topic models tend
to always show better fit with more topics, which means that there
is no natural number of topics to choose.
by G. Doyle and C. Elkan, published at ICML
in 2009
. The file is 8
megabytes approximately because it contains several datasets used
for experiments.