7 Jun 2024 | Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Author, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muenninghoff, Aankanksa Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, Hannaneh Hajishirzi
The paper introduces OLMo, a powerful and truly open language model, along with its framework for building and studying language models. Unlike previous efforts that have only released model weights and inference code, OLMo includes open training data, training and evaluation code, intermediate model checkpoints, and training logs. The authors aim to facilitate scientific research on language models, including their biases and potential risks. OLMo is trained on a diverse, multi-source corpus called Dolma, which contains trillions of tokens across billions of documents. The paper details the architecture, training setup, and evaluation methods used for OLMo, and compares its performance against other large language models on various tasks. The results show that OLMo is competitive and performs well on downstream tasks and intrinsic language modeling evaluations. The authors also discuss the limitations of their work, such as the challenges in training large language models and the potential risks associated with AI systems. They emphasize the importance of openness in advancing the field of language models and plan to continuously support and extend OLMo.The paper introduces OLMo, a powerful and truly open language model, along with its framework for building and studying language models. Unlike previous efforts that have only released model weights and inference code, OLMo includes open training data, training and evaluation code, intermediate model checkpoints, and training logs. The authors aim to facilitate scientific research on language models, including their biases and potential risks. OLMo is trained on a diverse, multi-source corpus called Dolma, which contains trillions of tokens across billions of documents. The paper details the architecture, training setup, and evaluation methods used for OLMo, and compares its performance against other large language models on various tasks. The results show that OLMo is competitive and performs well on downstream tasks and intrinsic language modeling evaluations. The authors also discuss the limitations of their work, such as the challenges in training large language models and the potential risks associated with AI systems. They emphasize the importance of openness in advancing the field of language models and plan to continuously support and extend OLMo.