This paper introduces Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations for variable-length texts such as sentences, paragraphs, and documents. Unlike traditional bag-of-words models, which lose word order and semantic meaning, Paragraph Vector captures the semantics of words and preserves word order in a small context. The algorithm represents each document as a dense vector trained to predict words within the document. It is capable of handling texts of any length and does not require task-specific tuning or parsing.
Paragraph Vector is inspired by neural network-based word vector learning methods, where word vectors are trained to predict the next word in a sentence. It extends this idea to paragraphs by training paragraph vectors to predict the next word in a context. The paragraph vector is shared across all contexts from the same paragraph but not across paragraphs. Word vectors are shared across all paragraphs. At prediction time, the paragraph vector is inferred by fixing the word vectors and training the new paragraph vector until convergence.
The algorithm is tested on several text classification and sentiment analysis tasks, achieving new state-of-the-art results. On sentiment analysis tasks, it outperforms bag-of-words models and other techniques, achieving a relative improvement of over 16% in error rate. On text classification tasks, it convincingly beats bag-of-words models, achieving a relative improvement of about 30%.
The paper also presents two variants of Paragraph Vector: PV-DM (Distributed Memory) and PV-DBOW (Distributed Bag of Words). PV-DM considers the paragraph as a memory that captures the topic of the paragraph, while PV-DBOW ignores the context words and focuses on predicting words randomly sampled from the paragraph. The combination of PV-DM and PV-DBOW often performs better than either alone.
Experiments on the Stanford Sentiment Treebank and IMDB datasets show that Paragraph Vector outperforms other methods, including recursive neural networks and bag-of-words models. On an information retrieval task, Paragraph Vector also performs well, achieving a 32% relative improvement in error rate compared to other methods.
The paper concludes that Paragraph Vector is a powerful method for capturing the semantics of text and is particularly useful for tasks with limited labeled data. It is an unsupervised method that can be applied to a wide range of text and sequential data tasks.This paper introduces Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations for variable-length texts such as sentences, paragraphs, and documents. Unlike traditional bag-of-words models, which lose word order and semantic meaning, Paragraph Vector captures the semantics of words and preserves word order in a small context. The algorithm represents each document as a dense vector trained to predict words within the document. It is capable of handling texts of any length and does not require task-specific tuning or parsing.
Paragraph Vector is inspired by neural network-based word vector learning methods, where word vectors are trained to predict the next word in a sentence. It extends this idea to paragraphs by training paragraph vectors to predict the next word in a context. The paragraph vector is shared across all contexts from the same paragraph but not across paragraphs. Word vectors are shared across all paragraphs. At prediction time, the paragraph vector is inferred by fixing the word vectors and training the new paragraph vector until convergence.
The algorithm is tested on several text classification and sentiment analysis tasks, achieving new state-of-the-art results. On sentiment analysis tasks, it outperforms bag-of-words models and other techniques, achieving a relative improvement of over 16% in error rate. On text classification tasks, it convincingly beats bag-of-words models, achieving a relative improvement of about 30%.
The paper also presents two variants of Paragraph Vector: PV-DM (Distributed Memory) and PV-DBOW (Distributed Bag of Words). PV-DM considers the paragraph as a memory that captures the topic of the paragraph, while PV-DBOW ignores the context words and focuses on predicting words randomly sampled from the paragraph. The combination of PV-DM and PV-DBOW often performs better than either alone.
Experiments on the Stanford Sentiment Treebank and IMDB datasets show that Paragraph Vector outperforms other methods, including recursive neural networks and bag-of-words models. On an information retrieval task, Paragraph Vector also performs well, achieving a 32% relative improvement in error rate compared to other methods.
The paper concludes that Paragraph Vector is a powerful method for capturing the semantics of text and is particularly useful for tasks with limited labeled data. It is an unsupervised method that can be applied to a wide range of text and sequential data tasks.