June 2010 | David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin
This paper introduces the task of topic coherence evaluation, where a set of words generated by a topic model is rated for coherence or interpretability. Various topic scoring models are applied, using resources like WordNet, Wikipedia, and Google search engine, as well as existing research on lexical similarity/relatedness. Comparing with human scores for learned topics over two datasets, a simple co-occurrence measure based on pointwise mutual information over Wikipedia data achieves results at or near the level of inter-annotator correlation, while other Wikipedia-based methods also show strong results. Google produces strong but less consistent results, while WordNet-based methods show patchy results.
The paper explores intrinsic evaluation of topics, which is typically overlooked in computational linguistics. Topics are evaluated based on their coherence to humans, using news articles and books. Models are proposed to predict topic coherence using resources like WordNet, Wikipedia, and Google. The results show remarkable inter-annotator agreement on what constitutes a coherent topic, with Wikipedia-based methods achieving nearly perfect agreement with humans.
The research is part of a larger agenda on the utility of topic modeling in document collection visualization and search interfaces. Evaluating topic coherence is part of understanding what makes good topics and how topic modeling can be used for human consumption.
The paper compares different methods for evaluating topic coherence, including WordNet-based similarity measures, Wikipedia-based methods, and Google-based methods. The best-performing method is term co-occurrence based on pointwise mutual information from Wikipedia, which achieves results close to inter-annotator agreement. Google performs well on one dataset but poorly on another. WordNet-based methods show lower performance overall.
The paper concludes that topic coherence evaluation is computationally feasible and that word-pair co-occurrence is effective at modeling topic coherence. Wikipedia is the most consistent resource for topic scoring, followed by PMI and other methods. The results suggest that Wikipedia's encyclopedic nature provides good coverage over both domains, making it more robust. WordNet-based methods show mixed results, with some methods performing poorly. The paper also notes that there is no clear answer on whether the mean or median is the best method for combining pairwise scores.This paper introduces the task of topic coherence evaluation, where a set of words generated by a topic model is rated for coherence or interpretability. Various topic scoring models are applied, using resources like WordNet, Wikipedia, and Google search engine, as well as existing research on lexical similarity/relatedness. Comparing with human scores for learned topics over two datasets, a simple co-occurrence measure based on pointwise mutual information over Wikipedia data achieves results at or near the level of inter-annotator correlation, while other Wikipedia-based methods also show strong results. Google produces strong but less consistent results, while WordNet-based methods show patchy results.
The paper explores intrinsic evaluation of topics, which is typically overlooked in computational linguistics. Topics are evaluated based on their coherence to humans, using news articles and books. Models are proposed to predict topic coherence using resources like WordNet, Wikipedia, and Google. The results show remarkable inter-annotator agreement on what constitutes a coherent topic, with Wikipedia-based methods achieving nearly perfect agreement with humans.
The research is part of a larger agenda on the utility of topic modeling in document collection visualization and search interfaces. Evaluating topic coherence is part of understanding what makes good topics and how topic modeling can be used for human consumption.
The paper compares different methods for evaluating topic coherence, including WordNet-based similarity measures, Wikipedia-based methods, and Google-based methods. The best-performing method is term co-occurrence based on pointwise mutual information from Wikipedia, which achieves results close to inter-annotator agreement. Google performs well on one dataset but poorly on another. WordNet-based methods show lower performance overall.
The paper concludes that topic coherence evaluation is computationally feasible and that word-pair co-occurrence is effective at modeling topic coherence. Wikipedia is the most consistent resource for topic scoring, followed by PMI and other methods. The results suggest that Wikipedia's encyclopedic nature provides good coverage over both domains, making it more robust. WordNet-based methods show mixed results, with some methods performing poorly. The paper also notes that there is no clear answer on whether the mean or median is the best method for combining pairwise scores.