July 27-31, 2011 | David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum
This paper introduces a new statistical topic model that significantly improves topic quality in large-scale document collections by directly optimizing a metric of topic coherence. The authors analyze the ways in which topics can be flawed and propose an automated evaluation metric for identifying such topics without relying on human annotators or external reference corpora. They also introduce a novel topic model based on this metric that enhances topic coherence.
The paper discusses the limitations of traditional topic models like latent Dirichlet allocation (LDA), which often produce low-quality topics that are not interpretable to domain experts. The authors propose a new coherence metric that measures the co-occurrence of words within a topic and shows that it correlates well with human judgments of topic quality. This metric is used to evaluate the quality of topics in a study of 148 topics from the National Institutes of Health (NIH), where 90 were labeled as "good," 21 as "intermediate," and 37 as "bad."
The authors also introduce a generalized Pólya urn model that incorporates word co-occurrence information directly into the statistical topic modeling framework. This model improves topic coherence and performs better than LDA in terms of both topic coherence and held-out probability. The model is evaluated on a corpus of NIH grant abstracts and shows significant improvements in topic coherence, particularly for the ten lowest-scoring topics.
The paper concludes that the new model achieves better performance with substantially fewer Gibbs iterations than LDA and that it is possible to construct unsupervised topic models that do not produce bad topics. The authors also suggest that such methods may be useful in guiding online stochastic variational inference.This paper introduces a new statistical topic model that significantly improves topic quality in large-scale document collections by directly optimizing a metric of topic coherence. The authors analyze the ways in which topics can be flawed and propose an automated evaluation metric for identifying such topics without relying on human annotators or external reference corpora. They also introduce a novel topic model based on this metric that enhances topic coherence.
The paper discusses the limitations of traditional topic models like latent Dirichlet allocation (LDA), which often produce low-quality topics that are not interpretable to domain experts. The authors propose a new coherence metric that measures the co-occurrence of words within a topic and shows that it correlates well with human judgments of topic quality. This metric is used to evaluate the quality of topics in a study of 148 topics from the National Institutes of Health (NIH), where 90 were labeled as "good," 21 as "intermediate," and 37 as "bad."
The authors also introduce a generalized Pólya urn model that incorporates word co-occurrence information directly into the statistical topic modeling framework. This model improves topic coherence and performs better than LDA in terms of both topic coherence and held-out probability. The model is evaluated on a corpus of NIH grant abstracts and shows significant improvements in topic coherence, particularly for the ten lowest-scoring topics.
The paper concludes that the new model achieves better performance with substantially fewer Gibbs iterations than LDA and that it is possible to construct unsupervised topic models that do not produce bad topics. The authors also suggest that such methods may be useful in guiding online stochastic variational inference.