BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

27 Jun 2023 | Tevon Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Thomas Wolf, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel
BLOOM is a 176B-parameter open-access multilingual language model developed through a collaboration of hundreds of researchers. It is a decoder-only Transformer model trained on the ROOTS corpus, a dataset comprising text in 46 natural languages and 13 programming languages. BLOOM achieves competitive performance on various benchmarks, with stronger results after multitask prompted finetuning. The model is publicly released under the Responsible AI License to facilitate future research and applications using large language models. The development of BLOOM was supported by a French public grant from GENCI and IDRIS, leveraging IDRIS' Jean Zay supercomputer. The project involved a comprehensive design process for its components, including the training dataset, model architecture, and engineering strategy for distributed learning. The model's capabilities were analyzed, and the coordinated development process was documented. The BLOOM model was trained on a diverse range of text sources, including a composite collection of 498 Hugging Face datasets, and underwent data preprocessing to ensure quality and privacy. The model's architecture is a causal decoder-only model, chosen for its effectiveness in zero-shot generalization. The tokenizer was carefully designed to ensure lossless encoding of sentences in multiple languages. BLOOM was also fine-tuned using multitask prompted datasets to enhance its multilingual zero-shot task generalization abilities. The model's development was guided by an Ethical Charter emphasizing inclusivity, diversity, openness, and responsibility. The project aimed to address social limitations of large language models by involving a diverse group of researchers and ensuring data governance and ethical considerations. The BLOOM model represents a significant step towards democratizing large language models and promoting inclusive, collaborative, and reliable governance of the technology.BLOOM is a 176B-parameter open-access multilingual language model developed through a collaboration of hundreds of researchers. It is a decoder-only Transformer model trained on the ROOTS corpus, a dataset comprising text in 46 natural languages and 13 programming languages. BLOOM achieves competitive performance on various benchmarks, with stronger results after multitask prompted finetuning. The model is publicly released under the Responsible AI License to facilitate future research and applications using large language models. The development of BLOOM was supported by a French public grant from GENCI and IDRIS, leveraging IDRIS' Jean Zay supercomputer. The project involved a comprehensive design process for its components, including the training dataset, model architecture, and engineering strategy for distributed learning. The model's capabilities were analyzed, and the coordinated development process was documented. The BLOOM model was trained on a diverse range of text sources, including a composite collection of 498 Hugging Face datasets, and underwent data preprocessing to ensure quality and privacy. The model's architecture is a causal decoder-only model, chosen for its effectiveness in zero-shot generalization. The tokenizer was carefully designed to ensure lossless encoding of sentences in multiple languages. BLOOM was also fine-tuned using multitask prompted datasets to enhance its multilingual zero-shot task generalization abilities. The model's development was guided by an Ethical Charter emphasizing inclusivity, diversity, openness, and responsibility. The project aimed to address social limitations of large language models by involving a diverse group of researchers and ensuring data governance and ethical considerations. The BLOOM model represents a significant step towards democratizing large language models and promoting inclusive, collaborative, and reliable governance of the technology.
Reach us at info@study.space
[slides and audio] BLOOM%3A A 176B-Parameter Open-Access Multilingual Language Model