9 Apr 2025 | Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F.T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo
CroissantLLM is a 1.3B bilingual French-English language model pre-trained on 3T tokens of English and French data. It is designed to be a high-performance, fully open-sourced model that runs efficiently on consumer-grade hardware. The model is trained with a 1:1 English-to-French data ratio, a custom tokenizer, and bilingual fine-tuning datasets. The training data includes a French split with high-quality, manually curated sources. A novel benchmark, FrenchBench, is introduced to evaluate model performance in French, covering various aspects of language understanding and generation. The model is also evaluated using the FMTI framework, achieving 81% of the transparency criteria. The model is released with codebases, checkpoints, and fine-tuned chat models, enabling research and industrial use. The model is trained on a 2300:1 token-to-parameter ratio, leading to strong performance in its size category. It is efficient to run, enabling low-latency, energy-efficient inference on low-resource devices. The model is designed to be as open and transparent as possible, with a focus on reducing English bias and improving multilingual understanding. The model is trained on a diverse corpus including internet data, legal documents, scientific articles, and cultural data. The training data includes a balanced mix of English and French, as well as code data. The model is evaluated on various benchmarks, including English and French language tasks, and shows strong performance in both. The model is also fine-tuned for chat and translation tasks, demonstrating its versatility. The model is optimized for inference, making it suitable for industrial and research applications. The model is trained on a supercomputer using low-carbon nuclear electricity, with a focus on reducing environmental impact. The model is designed to be inference-optimized, allowing it to run efficiently on a wide range of devices. The model is released with a commitment to transparency, allowing research and industrial adoption. The model is a proven platform for researching LLMs and kickstarting future pretraining efforts.CroissantLLM is a 1.3B bilingual French-English language model pre-trained on 3T tokens of English and French data. It is designed to be a high-performance, fully open-sourced model that runs efficiently on consumer-grade hardware. The model is trained with a 1:1 English-to-French data ratio, a custom tokenizer, and bilingual fine-tuning datasets. The training data includes a French split with high-quality, manually curated sources. A novel benchmark, FrenchBench, is introduced to evaluate model performance in French, covering various aspects of language understanding and generation. The model is also evaluated using the FMTI framework, achieving 81% of the transparency criteria. The model is released with codebases, checkpoints, and fine-tuned chat models, enabling research and industrial use. The model is trained on a 2300:1 token-to-parameter ratio, leading to strong performance in its size category. It is efficient to run, enabling low-latency, energy-efficient inference on low-resource devices. The model is designed to be as open and transparent as possible, with a focus on reducing English bias and improving multilingual understanding. The model is trained on a diverse corpus including internet data, legal documents, scientific articles, and cultural data. The training data includes a balanced mix of English and French, as well as code data. The model is evaluated on various benchmarks, including English and French language tasks, and shows strong performance in both. The model is also fine-tuned for chat and translation tasks, demonstrating its versatility. The model is optimized for inference, making it suitable for industrial and research applications. The model is trained on a supercomputer using low-carbon nuclear electricity, with a focus on reducing environmental impact. The model is designed to be inference-optimized, allowing it to run efficiently on a wide range of devices. The model is released with a commitment to transparency, allowing research and industrial adoption. The model is a proven platform for researching LLMs and kickstarting future pretraining efforts.