Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

8-14 July 2012 | Sida Wang and Christopher D. Manning
This paper investigates the performance of Naive Bayes (NB) and Support Vector Machines (SVM) in text classification tasks, particularly sentiment analysis. It shows that including word bigram features consistently improves sentiment analysis performance. For short snippet sentiment tasks, NB outperforms SVMs, while for longer documents, SVMs perform better. A novel SVM variant using NB log-count ratios as features performs well across tasks and datasets. The paper identifies simple NB and SVM variants that outperform most published results on sentiment analysis datasets. The paper distinguishes between sentiment classification and topical text classification. It shows that bigram features are underappreciated in sentiment classification but are more mixed in topical text classification. It also shows that NB models generally outperform sophisticated, structure-sensitive models in snippet sentiment classification tasks. Combining generative and discriminative classifiers, the paper presents a simple model variant where an SVM is built over NB log-count ratios as feature values, which performs well across tasks. The paper evaluates several datasets, including short movie reviews, customer reviews, opinion polarity, subjectivity, and full-length movie reviews. It finds that MNB performs better on snippet tasks than SVMs, while SVMs perform better on full-length reviews. Bigram features improve performance on all tasks, especially in sentiment classification where they capture modified verbs and nouns. The paper also shows that NBSVM, a variant of SVM using NB features, performs well on both snippets and longer documents, and is often better than previously published results. The paper concludes that NB and SVM variants, particularly NBSVM, are strong baselines for text classification tasks. It also shows that binarized MNB performs better than standard MNB, and that NB is more stable than multivariate Bernoulli NB. The code and datasets to reproduce the results are publicly available.This paper investigates the performance of Naive Bayes (NB) and Support Vector Machines (SVM) in text classification tasks, particularly sentiment analysis. It shows that including word bigram features consistently improves sentiment analysis performance. For short snippet sentiment tasks, NB outperforms SVMs, while for longer documents, SVMs perform better. A novel SVM variant using NB log-count ratios as features performs well across tasks and datasets. The paper identifies simple NB and SVM variants that outperform most published results on sentiment analysis datasets. The paper distinguishes between sentiment classification and topical text classification. It shows that bigram features are underappreciated in sentiment classification but are more mixed in topical text classification. It also shows that NB models generally outperform sophisticated, structure-sensitive models in snippet sentiment classification tasks. Combining generative and discriminative classifiers, the paper presents a simple model variant where an SVM is built over NB log-count ratios as feature values, which performs well across tasks. The paper evaluates several datasets, including short movie reviews, customer reviews, opinion polarity, subjectivity, and full-length movie reviews. It finds that MNB performs better on snippet tasks than SVMs, while SVMs perform better on full-length reviews. Bigram features improve performance on all tasks, especially in sentiment classification where they capture modified verbs and nouns. The paper also shows that NBSVM, a variant of SVM using NB features, performs well on both snippets and longer documents, and is often better than previously published results. The paper concludes that NB and SVM variants, particularly NBSVM, are strong baselines for text classification tasks. It also shows that binarized MNB performs better than standard MNB, and that NB is more stable than multivariate Bernoulli NB. The code and datasets to reproduce the results are publicly available.
Reach us at info@futurestudyspace.com
Understanding Baselines and Bigrams%3A Simple%2C Good Sentiment and Topic Classification