This paper presents a series of experiments with convolutional neural networks (CNNs) trained on top of pre-trained word vectors for sentence-level classification tasks. The authors show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. They also propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed improve upon the state of the art on 4 out of 7 tasks, including sentiment analysis and question classification.
The model architecture is a slight variant of the CNN architecture of Collobert et al. (2011). Word vectors are obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on 100 billion words of Google News and are publicly available. The authors initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are 'universal' feature extractors that can be utilized for various classification tasks. Learning task-specific vectors through fine-tuning results in further improvements. The model uses multiple filters (with varying window sizes) to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.
The authors also experiment with having two 'channels' of word vectors—one that is kept static throughout training and one that is fine-tuned via backpropagation. In the multichannel architecture, each filter is applied to both channels and the results are added to calculate $ c_i $ in equation (2). The model is otherwise equivalent to the single channel architecture.
The authors report that their model performs well on various benchmarks, including MR, SST-1, SST-2, Subj, TREC, CR, and MPQA. The results show that the pre-trained vectors are good, 'universal' feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements. The authors also discuss the results of their experiments, including the effectiveness of dropout as a regularizer and the impact of different initialization methods on model performance. They conclude that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.This paper presents a series of experiments with convolutional neural networks (CNNs) trained on top of pre-trained word vectors for sentence-level classification tasks. The authors show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. They also propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed improve upon the state of the art on 4 out of 7 tasks, including sentiment analysis and question classification.
The model architecture is a slight variant of the CNN architecture of Collobert et al. (2011). Word vectors are obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on 100 billion words of Google News and are publicly available. The authors initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are 'universal' feature extractors that can be utilized for various classification tasks. Learning task-specific vectors through fine-tuning results in further improvements. The model uses multiple filters (with varying window sizes) to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.
The authors also experiment with having two 'channels' of word vectors—one that is kept static throughout training and one that is fine-tuned via backpropagation. In the multichannel architecture, each filter is applied to both channels and the results are added to calculate $ c_i $ in equation (2). The model is otherwise equivalent to the single channel architecture.
The authors report that their model performs well on various benchmarks, including MR, SST-1, SST-2, Subj, TREC, CR, and MPQA. The results show that the pre-trained vectors are good, 'universal' feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements. The authors also discuss the results of their experiments, including the effectiveness of dropout as a regularizer and the impact of different initialization methods on model performance. They conclude that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.