[slides and audio] Advances in Pre-Training Distributed Word Representations

This paper presents a set of advanced pre-trained word vector representations, which significantly outperform existing models on various tasks. The authors combine several known but rarely used techniques to improve the quality of word vectors, including position-dependent features, phrase representations, and subword information. These enhancements are applied to the standard skip-gram and continuous bag-of-words (cbow) models used in word2vec and fastText. The results are evaluated on standard benchmarks such as syntactic and semantic analogies, rare words datasets, and question answering systems. The new pre-trained models, available for public use, demonstrate superior performance compared to GloVe models trained on comparable corpora, achieving the best accuracy on word analogy tasks by a large margin. The paper also highlights the importance of sentence-level de-duplication and phrase construction in large corpora for improving vector quality.This paper presents a set of advanced pre-trained word vector representations, which significantly outperform existing models on various tasks. The authors combine several known but rarely used techniques to improve the quality of word vectors, including position-dependent features, phrase representations, and subword information. These enhancements are applied to the standard skip-gram and continuous bag-of-words (cbow) models used in word2vec and fastText. The results are evaluated on standard benchmarks such as syntactic and semantic analogies, rare words datasets, and question answering systems. The new pre-trained models, available for public use, demonstrate superior performance compared to GloVe models trained on comparable corpora, achieving the best accuracy on word analogy tasks by a large margin. The paper also highlights the importance of sentence-level de-duplication and phrase construction in large corpora for improving vector quality.

Advances in Pre-Training Distributed Word Representations

26 Dec 2017 | Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin