July 25, 2010 | Liangjie Hong and Brian D. Davison
This paper presents an empirical study on topic modeling in Twitter, focusing on how to effectively train standard topic models in short text environments. The study investigates different training schemes for topic models on Twitter data, comparing their quality and effectiveness through both qualitative and quantitative experiments. The authors propose three schemes: MSG (training on individual messages), USER (training on aggregated user messages), and TERM (training on aggregated term profiles). They also explore the Author-Topic model, which extends LDA to account for both authors and topics.
The study shows that training a topic model on aggregated user messages leads to a higher quality model and better performance in two real-world classification tasks. The results indicate that topics learned using different aggregation strategies are substantially different, and that training on aggregated messages results in faster training and better quality. Topic mixture distributions from topic models can serve as useful features in classification tasks, significantly improving overall performance.
The authors also discuss the limitations of the Author-Topic model in capturing hierarchical relationships between entities in social media. They find that the AT model does not perform as well as the USER scheme in terms of topic modeling for messages. The study concludes that topic models can be very useful for short text either as standalone features or as complementary features for multiple real-world tasks. However, when content information is already sufficient (e.g., in user classification), topic models become less effective compared to simple TF-IDF scores. The study also shows that the simple extension of the AT model does not yield better results for messages and users compared to training a standard LDA model on user aggregated profiles. The authors suggest that future models should consider how to model hierarchical structures between users and messages.This paper presents an empirical study on topic modeling in Twitter, focusing on how to effectively train standard topic models in short text environments. The study investigates different training schemes for topic models on Twitter data, comparing their quality and effectiveness through both qualitative and quantitative experiments. The authors propose three schemes: MSG (training on individual messages), USER (training on aggregated user messages), and TERM (training on aggregated term profiles). They also explore the Author-Topic model, which extends LDA to account for both authors and topics.
The study shows that training a topic model on aggregated user messages leads to a higher quality model and better performance in two real-world classification tasks. The results indicate that topics learned using different aggregation strategies are substantially different, and that training on aggregated messages results in faster training and better quality. Topic mixture distributions from topic models can serve as useful features in classification tasks, significantly improving overall performance.
The authors also discuss the limitations of the Author-Topic model in capturing hierarchical relationships between entities in social media. They find that the AT model does not perform as well as the USER scheme in terms of topic modeling for messages. The study concludes that topic models can be very useful for short text either as standalone features or as complementary features for multiple real-world tasks. However, when content information is already sufficient (e.g., in user classification), topic models become less effective compared to simple TF-IDF scores. The study also shows that the simple extension of the AT model does not yield better results for messages and users compared to training a standard LDA model on user aggregated profiles. The authors suggest that future models should consider how to model hierarchical structures between users and messages.