[slides and audio] Optimizing Semantic Coherence in Topic Models

This paper addresses the issue of low-quality topics in latent variable models, particularly in topic models like Latent Dirichlet Allocation (LDA). The authors identify three main contributions: (1) an analysis of the ways in which topics can be flawed, (2) an automated evaluation metric for identifying such topics without relying on human annotators or external reference corpora, and (3) a novel statistical topic model that significantly improves topic quality. The paper begins by discussing the importance of high-quality topics for the acceptance of statistical topic models by users. It highlights the relationship between the size of topics and their quality, noting that smaller topics are often more likely to be poor quality. Traditional evaluation methods, such as held-out document probabilities, are shown to have limitations in predicting human judgments of topic quality. To address these issues, the authors propose a new topic coherence score based on word co-occurrence statistics. This score is designed to identify specific semantic problems in topic models without requiring human interaction or external reference corpora. The authors also introduce a generalized Pólya urn model that incorporates this coherence score directly into the topic modeling framework. This model is shown to improve the average topic coherence score and the quality of the ten lowest-scoring topics while retaining the ability to identify low-quality topics. The paper concludes with a discussion of the challenges and future directions in topic modeling, emphasizing the importance of improving the semantic quality of topics and scaling to larger datasets while maintaining high-quality topics.This paper addresses the issue of low-quality topics in latent variable models, particularly in topic models like Latent Dirichlet Allocation (LDA). The authors identify three main contributions: (1) an analysis of the ways in which topics can be flawed, (2) an automated evaluation metric for identifying such topics without relying on human annotators or external reference corpora, and (3) a novel statistical topic model that significantly improves topic quality. The paper begins by discussing the importance of high-quality topics for the acceptance of statistical topic models by users. It highlights the relationship between the size of topics and their quality, noting that smaller topics are often more likely to be poor quality. Traditional evaluation methods, such as held-out document probabilities, are shown to have limitations in predicting human judgments of topic quality. To address these issues, the authors propose a new topic coherence score based on word co-occurrence statistics. This score is designed to identify specific semantic problems in topic models without requiring human interaction or external reference corpora. The authors also introduce a generalized Pólya urn model that incorporates this coherence score directly into the topic modeling framework. This model is shown to improve the average topic coherence score and the quality of the ten lowest-scoring topics while retaining the ability to identify low-quality topics. The paper concludes with a discussion of the challenges and future directions in topic modeling, emphasizing the importance of improving the semantic quality of topics and scaling to larger datasets while maintaining high-quality topics.

Optimizing Semantic Coherence in Topic Models

July 27–31, 2011 | David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum