CamemBERT: a Tasty French Language Model

CamemBERT: a Tasty French Language Model

21 May 2020 | Louis Martin*1,2,3 Benjamin Muller*2,3 Pedro Javier Ortiz Suárez*2,3 Yoann Dupont3 Laurent Romary2 Éric Villemonte de la Clergerie2 Djamé Seddah2 Benoît Sagot2
CamemBERT is a monolingual French language model based on the RoBERTa architecture. It was trained on the OSCAR corpus, a large-scale, pre-filtered version of Common Crawl. The model achieves state-of-the-art results on four downstream tasks: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER), and natural language inference (NLI). CamemBERT outperforms existing multilingual models such as mBERT, XLM, and XLM-R, and performs well even with a relatively small training dataset of 4GB. The model uses web-crawled data, which is more effective than Wikipedia data for these tasks. CamemBERT is available under an MIT license and can be used for various NLP tasks. The model's performance is evaluated on four French treebanks and the XNLI benchmark. It shows that smaller, diverse training data can achieve similar results to larger datasets, and that pretraining on web data is more effective than on Wikipedia data. The model's success demonstrates the effectiveness of large pretrained language models for French, and highlights the potential for monolingual models in under-resourced languages. CamemBERT is a significant contribution to the field of NLP, as it provides a high-quality, freely available French language model that can be used for a wide range of tasks.CamemBERT is a monolingual French language model based on the RoBERTa architecture. It was trained on the OSCAR corpus, a large-scale, pre-filtered version of Common Crawl. The model achieves state-of-the-art results on four downstream tasks: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER), and natural language inference (NLI). CamemBERT outperforms existing multilingual models such as mBERT, XLM, and XLM-R, and performs well even with a relatively small training dataset of 4GB. The model uses web-crawled data, which is more effective than Wikipedia data for these tasks. CamemBERT is available under an MIT license and can be used for various NLP tasks. The model's performance is evaluated on four French treebanks and the XNLI benchmark. It shows that smaller, diverse training data can achieve similar results to larger datasets, and that pretraining on web data is more effective than on Wikipedia data. The model's success demonstrates the effectiveness of large pretrained language models for French, and highlights the potential for monolingual models in under-resourced languages. CamemBERT is a significant contribution to the field of NLP, as it provides a high-quality, freely available French language model that can be used for a wide range of tasks.
Reach us at info@study.space
Understanding CamemBERT%3A a Tasty French Language Model