Understanding Pythia%3A A Suite for Analyzing Large Language Models Across Training and Scaling

Pythia is a suite of 16 large language models (LLMs) trained on public data with the same data ordering and ranging in size from 70M to 12B parameters. The suite provides 154 checkpoints for each model, along with tools to download and reconstruct the exact training dataloader. Pythia aims to facilitate research in various areas, including memorization, term frequency effects on few-shot performance, and reducing gender bias. Key findings include: 1. **Mitigating Gender Bias**: By modifying the frequency of gendered terms in the pretraining data, Pythia models can reduce bias measures on targeted benchmarks. 2. **Memorization as a Poisson Point Process**: The location of sequences in the training dataset does not significantly influence their memorization likelihood, and a Poisson model fits the data well. 3. **Emergence of Pretraining Term Frequencies**: A significant phase change occurs after 65,000 training steps, where models with 2.8 billion parameters or more exhibit a correlation between task accuracy and task-relevant term frequencies. Pythia's suite is designed to meet three key properties: public access, consistent training provenance, and consistency across scale. The models are trained on the Pile dataset, and two copies of the suite are provided, one trained on the original Pile and the other on a deduplicated version. The suite includes detailed documentation on model architecture, training procedures, and evaluation methods. Pythia's extensive checkpoints and consistent training setup enable novel insights into the behavior of LLMs, particularly in areas such as bias mitigation, memorization dynamics, and the impact of pretraining term frequencies.Pythia is a suite of 16 large language models (LLMs) trained on public data with the same data ordering and ranging in size from 70M to 12B parameters. The suite provides 154 checkpoints for each model, along with tools to download and reconstruct the exact training dataloader. Pythia aims to facilitate research in various areas, including memorization, term frequency effects on few-shot performance, and reducing gender bias. Key findings include: 1. **Mitigating Gender Bias**: By modifying the frequency of gendered terms in the pretraining data, Pythia models can reduce bias measures on targeted benchmarks. 2. **Memorization as a Poisson Point Process**: The location of sequences in the training dataset does not significantly influence their memorization likelihood, and a Poisson model fits the data well. 3. **Emergence of Pretraining Term Frequencies**: A significant phase change occurs after 65,000 training steps, where models with 2.8 billion parameters or more exhibit a correlation between task accuracy and task-relevant term frequencies. Pythia's suite is designed to meet three key properties: public access, consistent training provenance, and consistency across scale. The models are trained on the Pile dataset, and two copies of the suite are provided, one trained on the original Pile and the other on a deduplicated version. The suite includes detailed documentation on model architecture, training procedures, and evaluation methods. Pythia's extensive checkpoints and consistent training setup enable novel insights into the behavior of LLMs, particularly in areas such as bias mitigation, memorization dynamics, and the impact of pretraining term frequencies.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Honolulu, Hawaii, USA. PMLR 202, 2023 | Stella Biderman * 1 2 Hailey Schoelkopf * 1 3 Quentin Anthony 1 Herbie Bradley 1 4 Kyle O'Brien 1 Eric Hallahan 1 Mohammad Aflah Khan 5 Shivanshu Purohit 6 1 USVSN Sai Prashanth 1 Edward Raff 2 Aviya Skowron 1 Lintang Sutawika 1 7 Oskar van der Wal 8