28 Mar 2018 | Edouard Grave1.*, Piotr Bojanowski1.*, Prakhar Gupta1,2 Armand Joulin1 Tomas Mikolov1
This paper presents the training of high-quality word vectors for 157 languages, utilizing data from Wikipedia and the Common Crawl project. The authors introduce three new word analogy datasets for French, Hindi, and Polish, and evaluate their pre-trained word vectors on 10 languages. The word vectors are trained using an extension of the fastText model with subword information, and the evaluation is conducted on word analogy tasks. The results show significant improvements in performance compared to previous models, particularly for languages with smaller Wikipedia datasets. The paper also discusses the impact of various hyperparameters and the use of noisy web data on the quality of the word vectors.This paper presents the training of high-quality word vectors for 157 languages, utilizing data from Wikipedia and the Common Crawl project. The authors introduce three new word analogy datasets for French, Hindi, and Polish, and evaluate their pre-trained word vectors on 10 languages. The word vectors are trained using an extension of the fastText model with subword information, and the evaluation is conducted on word analogy tasks. The results show significant improvements in performance compared to previous models, particularly for languages with smaller Wikipedia datasets. The paper also discusses the impact of various hyperparameters and the use of noisy web data on the quality of the word vectors.