[slides and audio] Supervised Topic Models

This paper introduces supervised latent Dirichlet allocation (sLDA), a statistical model for labeled documents. sLDA extends traditional latent Dirichlet allocation (LDA) by incorporating a response variable for each document, allowing the model to predict the response based on the document's content. The model accommodates various response types, including continuous, count, and categorical variables. The paper presents an approximate maximum-likelihood procedure for parameter estimation using variational methods to handle intractable posterior expectations. The model is applied to two real-world problems: predicting movie ratings from reviews and predicting the political tone of U.S. Senate amendments based on their text. The results show that sLDA outperforms both unsupervised LDA followed by regression and modern regularized regression techniques like the lasso. The sLDA model is based on a generative process where documents are represented as collections of words, and each document is associated with a response variable. The model jointly estimates the topics and the response, using a generalized linear model (GLM) framework to accommodate different response types. The model uses variational inference for posterior inference and parameter estimation, and it provides a way to predict new responses based on the estimated topic structure. The paper also discusses the computational challenges of sLDA, including posterior inference, parameter estimation, and prediction. It presents specific algorithms for Gaussian and Poisson response distributions and suggests a general approach for other exponential family responses. The results demonstrate that sLDA provides more predictive power than traditional methods, particularly in tasks where the response is influenced by the document's content. The model is flexible and can be applied to various types of response variables, making it a valuable tool for text analysis and prediction tasks.This paper introduces supervised latent Dirichlet allocation (sLDA), a statistical model for labeled documents. sLDA extends traditional latent Dirichlet allocation (LDA) by incorporating a response variable for each document, allowing the model to predict the response based on the document's content. The model accommodates various response types, including continuous, count, and categorical variables. The paper presents an approximate maximum-likelihood procedure for parameter estimation using variational methods to handle intractable posterior expectations. The model is applied to two real-world problems: predicting movie ratings from reviews and predicting the political tone of U.S. Senate amendments based on their text. The results show that sLDA outperforms both unsupervised LDA followed by regression and modern regularized regression techniques like the lasso. The sLDA model is based on a generative process where documents are represented as collections of words, and each document is associated with a response variable. The model jointly estimates the topics and the response, using a generalized linear model (GLM) framework to accommodate different response types. The model uses variational inference for posterior inference and parameter estimation, and it provides a way to predict new responses based on the estimated topic structure. The paper also discusses the computational challenges of sLDA, including posterior inference, parameter estimation, and prediction. It presents specific algorithms for Gaussian and Poisson response distributions and suggests a general approach for other exponential family responses. The results demonstrate that sLDA provides more predictive power than traditional methods, particularly in tasks where the response is influenced by the document's content. The model is flexible and can be applied to various types of response variables, making it a valuable tool for text analysis and prediction tasks.

Supervised Topic Models

3 Mar 2010 | David M. Blei, Jon D. McAuliffe