Understanding Distributed Representations of Sentences and Documents

The paper introduces Paragraph Vector (PV), an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of text, such as sentences and documents. Unlike bag-of-words models, PV addresses the limitations of losing word order and ignoring semantic distances between words. PV represents each document as a dense vector trained to predict words in the document, leveraging the idea of using word vectors to predict the next word in a context. The algorithm is trained using stochastic gradient descent and backpropagation, and it can be applied to various text classification and sentiment analysis tasks. Empirical results show that PV outperforms bag-of-words models and other techniques for text representations, achieving state-of-the-art results on several benchmarks, including sentiment analysis and information retrieval tasks.The paper introduces Paragraph Vector (PV), an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of text, such as sentences and documents. Unlike bag-of-words models, PV addresses the limitations of losing word order and ignoring semantic distances between words. PV represents each document as a dense vector trained to predict words in the document, leveraging the idea of using word vectors to predict the next word in a context. The algorithm is trained using stochastic gradient descent and backpropagation, and it can be applied to various text classification and sentiment analysis tasks. Empirical results show that PV outperforms bag-of-words models and other techniques for text representations, achieving state-of-the-art results on several benchmarks, including sentiment analysis and information retrieval tasks.

Distributed Representations of Sentences and Documents

22 May 2014 | Quoc Le, Tomas Mikolov