Honolulu, Hawaii, USA. PMLR 202, 2023 | Stella Biderman * 1 2 Hailey Schoelkopf * 1 3 Quentin Anthony 1 Herbie Bradley 1 4 Kyle O'Brien 1 Eric Hallahan 1 Mohammad Aflah Khan 5 Shivanshu Purohit 6 1 USVSN Sai Prashanth 1 Edward Raff 2 Aviya Skowron 1 Lintang Sutawika 1 7 Oskar van der Wal 8
Pythia is a suite of 16 large language models (LLMs) trained on the same public data in the same order, ranging from 70M to 12B parameters. The suite provides 154 checkpoints for each model and tools to download and reconstruct their training data for further study. The models are designed to facilitate research on LLMs, with case studies exploring memorization, gender bias, and term frequency effects. Pythia enables controlled experiments to analyze how training data and model scale influence LLM behavior. The suite includes models trained on the Pile dataset and a deduplicated version, allowing comparisons. The models are trained using a consistent architecture and hyperparameters, with a focus on reproducibility and public access. Pythia's public release includes training code, checkpoints, and evaluation tools, enabling researchers to study LLMs across various tasks and scales. The suite addresses gaps in existing research by providing a controlled environment for studying LLM training dynamics and biases. Key findings include the impact of pretraining data on gender bias, the role of term frequency in task performance, and the emergence of memorization patterns. Pythia's structured approach supports further research into LLM capabilities and limitations.Pythia is a suite of 16 large language models (LLMs) trained on the same public data in the same order, ranging from 70M to 12B parameters. The suite provides 154 checkpoints for each model and tools to download and reconstruct their training data for further study. The models are designed to facilitate research on LLMs, with case studies exploring memorization, gender bias, and term frequency effects. Pythia enables controlled experiments to analyze how training data and model scale influence LLM behavior. The suite includes models trained on the Pile dataset and a deduplicated version, allowing comparisons. The models are trained using a consistent architecture and hyperparameters, with a focus on reproducibility and public access. Pythia's public release includes training code, checkpoints, and evaluation tools, enabling researchers to study LLMs across various tasks and scales. The suite addresses gaps in existing research by providing a controlled environment for studying LLM training dynamics and biases. Key findings include the impact of pretraining data on gender bias, the role of term frequency in task performance, and the emergence of memorization patterns. Pythia's structured approach supports further research into LLM capabilities and limitations.