CroissantLLM: A Truly Bilingual French-English Language Model

CroissantLLM: A Truly Bilingual French-English Language Model

9 Apr 2025 | Manuel Fays, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F.T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo
CroissantLLM is a 1.3B parameter language model pre-trained on 3 trillion tokens of English and French text, designed to be a high-performance, fully open-sourced bilingual model. The model is trained with a 1:1 ratio of English to French data, using a custom tokenizer optimized for bilingualism and a set of fine-tuning datasets. The training dataset includes a diverse range of sources, such as internet data, literary works, legal documents, and scientific articles, with a focus on curating high-quality, varied French content. The authors introduce FrenchBench, a novel benchmark for evaluating models in French, covering various tasks like classification, generation, and language understanding. They release the training dataset, codebases, checkpoints, and fine-tuned Chat and translation models, adhering to transparency principles. The model is evaluated using the FMTI framework, achieving 81% of the transparency criteria, surpassing most open initiatives. CroissantLLM aims to address the limitations of current models by providing an inference-optimized, small but capable model that performs well outside of English settings. It is designed to be open and transparent, facilitating industrial adoption and research in multilingual language models. The model's performance is strong, outperforming existing monolingual and multilingual models on both English and French benchmarks, and demonstrating the benefits of training on a balanced bilingual corpus.CroissantLLM is a 1.3B parameter language model pre-trained on 3 trillion tokens of English and French text, designed to be a high-performance, fully open-sourced bilingual model. The model is trained with a 1:1 ratio of English to French data, using a custom tokenizer optimized for bilingualism and a set of fine-tuning datasets. The training dataset includes a diverse range of sources, such as internet data, literary works, legal documents, and scientific articles, with a focus on curating high-quality, varied French content. The authors introduce FrenchBench, a novel benchmark for evaluating models in French, covering various tasks like classification, generation, and language understanding. They release the training dataset, codebases, checkpoints, and fine-tuned Chat and translation models, adhering to transparency principles. The model is evaluated using the FMTI framework, achieving 81% of the transparency criteria, surpassing most open initiatives. CroissantLLM aims to address the limitations of current models by providing an inference-optimized, small but capable model that performs well outside of English settings. It is designed to be open and transparent, facilitating industrial adoption and research in multilingual language models. The model's performance is strong, outperforming existing monolingual and multilingual models on both English and French benchmarks, and demonstrating the benefits of training on a balanced bilingual corpus.
Reach us at info@study.space