21 May 2020 | Louis Martin*1,2,3 Benjamin Muller*2,3 Pedro Javier Ortiz Suárez*2,3 Yoann Dupont3 Laurent Romary2 Éric Villemonte de la Clergerie2 Djamé Seddah2 Benoît Sagot2
The paper introduces CamemBERT, a monolingual Transformer-based language model trained on French using web crawled data. The authors evaluate CamemBERT on four downstream tasks: part-of-speech tagging, dependency parsing, named entity recognition, and natural language inference. They find that CamemBERT achieves state-of-the-art results on all tasks, outperforming both monolingual and multilingual models such as mBERT, XLM, and XLM-R. The study also demonstrates that smaller, diverse training sets can achieve similar performance to larger, homogeneous datasets, highlighting the effectiveness of using web crawled data for pretraining. Additionally, the paper explores the impact of corpus origin and size on downstream task performance, showing that models trained on web crawled data perform better than those trained on Wikipedia-based data. Overall, CamemBERT opens up new possibilities for monolingual contextual pre-trained language models for under-resourced languages.The paper introduces CamemBERT, a monolingual Transformer-based language model trained on French using web crawled data. The authors evaluate CamemBERT on four downstream tasks: part-of-speech tagging, dependency parsing, named entity recognition, and natural language inference. They find that CamemBERT achieves state-of-the-art results on all tasks, outperforming both monolingual and multilingual models such as mBERT, XLM, and XLM-R. The study also demonstrates that smaller, diverse training sets can achieve similar performance to larger, homogeneous datasets, highlighting the effectiveness of using web crawled data for pretraining. Additionally, the paper explores the impact of corpus origin and size on downstream task performance, showing that models trained on web crawled data perform better than those trained on Wikipedia-based data. Overall, CamemBERT opens up new possibilities for monolingual contextual pre-trained language models for under-resourced languages.