Enriching Word Vectors with Subword Information

Enriching Word Vectors with Subword Information

19 Jun 2017 | Piotr Bojanowski* and Edouard Grave* and Armand Joulin and Tomas Mikolov
This paper proposes a method to enrich word vectors with subword information, improving word representations for morphologically rich languages. The approach uses the skipgram model, where each word is represented as a sum of character n-grams. Each character n-gram is assigned a vector, and words are represented as the sum of these vectors. This method allows for efficient training on large corpora and enables the computation of word representations for words not present in the training data. The model is evaluated on nine languages, showing state-of-the-art performance on word similarity and analogy tasks compared to existing morphological word representations. The method is fast, scalable, and effective for rare words. It also performs well on language modeling tasks, with significant improvements for morphologically rich languages like Czech and Russian. The model is implemented in C++ and is publicly available. The results show that subword information improves word representations, especially for morphologically complex languages, and that the model outperforms baselines in both similarity and analogy tasks. The approach is also effective for out-of-vocabulary words, as it can compute representations by summing the vectors of character n-grams. The model is compared to other methods, including morphological word representations and character-aware models, and shows superior performance in several tasks. The paper concludes that incorporating subword information into word representations is a promising approach for improving word representations in natural language processing.This paper proposes a method to enrich word vectors with subword information, improving word representations for morphologically rich languages. The approach uses the skipgram model, where each word is represented as a sum of character n-grams. Each character n-gram is assigned a vector, and words are represented as the sum of these vectors. This method allows for efficient training on large corpora and enables the computation of word representations for words not present in the training data. The model is evaluated on nine languages, showing state-of-the-art performance on word similarity and analogy tasks compared to existing morphological word representations. The method is fast, scalable, and effective for rare words. It also performs well on language modeling tasks, with significant improvements for morphologically rich languages like Czech and Russian. The model is implemented in C++ and is publicly available. The results show that subword information improves word representations, especially for morphologically complex languages, and that the model outperforms baselines in both similarity and analogy tasks. The approach is also effective for out-of-vocabulary words, as it can compute representations by summing the vectors of character n-grams. The model is compared to other methods, including morphological word representations and character-aware models, and shows superior performance in several tasks. The paper concludes that incorporating subword information into word representations is a promising approach for improving word representations in natural language processing.
Reach us at info@study.space
[slides and audio] Enriching Word Vectors with Subword Information