Learning Word Vectors for 157 Languages

Learning Word Vectors for 157 Languages

28 Mar 2018 | Edouard Grave1.*, Piotr Bojanowski1.*, Prakhar Gupta1,2 Armand Joulin1 Tomas Mikolov1
This paper presents the training of high-quality word vectors for 157 languages using Wikipedia and Common Crawl data. The authors developed three new word analogy datasets for French, Hindi, and Polish to evaluate their models. They also evaluated their pre-trained word vectors on 10 languages, showing strong performance compared to previous models. The word vectors were trained using a modified version of the fastText model with subword information. The training data included Wikipedia and Common Crawl, with the latter providing larger amounts of text data. The authors performed language identification and deduplication to prepare the data for training. They also tokenized the data for different languages using appropriate tools. The models were evaluated on word analogy tasks, where the goal is to find a word that completes a relationship between three words. The authors introduced three new analogy datasets for French, Hindi, and Polish, and evaluated their models on these datasets. They also compared their models with other variants, such as the fastText skipgram model with default parameters and models with position weights. The results showed that using the Common Crawl data significantly improved performance for languages with small Wikipedia data, such as Finnish and Hindi. However, for high-resource languages like German, Spanish, and French, the performance did not improve significantly. The authors also found that using more negative examples and more training epochs improved the accuracy of the models. The paper concludes that training word vectors on large-scale noisy data from the web can lead to better coverage and performance for languages with small Wikipedia data. The authors also note that for low-resource languages, the quality of the obtained word vectors is much lower than for other languages. Future work includes exploring more techniques to improve the quality of models for such languages.This paper presents the training of high-quality word vectors for 157 languages using Wikipedia and Common Crawl data. The authors developed three new word analogy datasets for French, Hindi, and Polish to evaluate their models. They also evaluated their pre-trained word vectors on 10 languages, showing strong performance compared to previous models. The word vectors were trained using a modified version of the fastText model with subword information. The training data included Wikipedia and Common Crawl, with the latter providing larger amounts of text data. The authors performed language identification and deduplication to prepare the data for training. They also tokenized the data for different languages using appropriate tools. The models were evaluated on word analogy tasks, where the goal is to find a word that completes a relationship between three words. The authors introduced three new analogy datasets for French, Hindi, and Polish, and evaluated their models on these datasets. They also compared their models with other variants, such as the fastText skipgram model with default parameters and models with position weights. The results showed that using the Common Crawl data significantly improved performance for languages with small Wikipedia data, such as Finnish and Hindi. However, for high-resource languages like German, Spanish, and French, the performance did not improve significantly. The authors also found that using more negative examples and more training epochs improved the accuracy of the models. The paper concludes that training word vectors on large-scale noisy data from the web can lead to better coverage and performance for languages with small Wikipedia data. The authors also note that for low-resource languages, the quality of the obtained word vectors is much lower than for other languages. Future work includes exploring more techniques to improve the quality of models for such languages.
Reach us at info@study.space