April 6, 2004 | Thomas L. Griffiths*†‡ and Mark Steyvers§
The article by Thomas L. Griffiths and Mark Steyvers introduces a statistical method for identifying the topics addressed in scientific documents using Latent Dirichlet Allocation (LDA), a generative model for documents. The authors describe the LDA model, which treats each document as a mixture of topics, and present a Markov chain Monte Carlo (MCMC) algorithm for inference in this model. They apply this method to abstracts from the Proceedings of the National Academy of Sciences (PNAS) from 1991 to 2001, determining the number of topics needed to account for the data and extracting a set of topics. The extracted topics capture meaningful structures in the data, consistent with the class designations provided by the authors of the articles. The authors also use the topics to illustrate relationships between different scientific disciplines, assess trends and "hot topics" by analyzing topic dynamics, and highlight the semantic content of documents by tagging words with their assigned topics. The results show that the algorithm can recover meaningful aspects of the structure of science and have several applications, including exploring topic dynamics and indicating the role of words in the semantic content of documents.The article by Thomas L. Griffiths and Mark Steyvers introduces a statistical method for identifying the topics addressed in scientific documents using Latent Dirichlet Allocation (LDA), a generative model for documents. The authors describe the LDA model, which treats each document as a mixture of topics, and present a Markov chain Monte Carlo (MCMC) algorithm for inference in this model. They apply this method to abstracts from the Proceedings of the National Academy of Sciences (PNAS) from 1991 to 2001, determining the number of topics needed to account for the data and extracting a set of topics. The extracted topics capture meaningful structures in the data, consistent with the class designations provided by the authors of the articles. The authors also use the topics to illustrate relationships between different scientific disciplines, assess trends and "hot topics" by analyzing topic dynamics, and highlight the semantic content of documents by tagging words with their assigned topics. The results show that the algorithm can recover meaningful aspects of the structure of science and have several applications, including exploring topic dynamics and indicating the role of words in the semantic content of documents.