27 Jun 2023 | Tevon Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Thomas Wolf, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel
BLOOM is a 176 billion parameter open-access multilingual language model developed by a collaboration of hundreds of researchers through the BigScience initiative. It was trained on the ROOTS corpus, which includes 46 natural and 13 programming languages. BLOOM demonstrates competitive performance on various benchmarks, with enhanced results after multitask prompted finetuning. The model's architecture, training dataset, and engineering strategy are detailed, emphasizing the importance of distributed learning and ethical considerations. The paper also discusses the challenges and solutions in data curation, tokenization, and model architecture, aiming to democratize access to large-scale language models and promote reproducibility and interpretability.BLOOM is a 176 billion parameter open-access multilingual language model developed by a collaboration of hundreds of researchers through the BigScience initiative. It was trained on the ROOTS corpus, which includes 46 natural and 13 programming languages. BLOOM demonstrates competitive performance on various benchmarks, with enhanced results after multitask prompted finetuning. The model's architecture, training dataset, and engineering strategy are detailed, emphasizing the importance of distributed learning and ethical considerations. The paper also discusses the challenges and solutions in data curation, tokenization, and model architecture, aiming to democratize access to large-scale language models and promote reproducibility and interpretability.