[slides] Physics of Language Models%3A Part 3.3%2C Knowledge Capacity Scaling Laws

This paper explores the scaling laws of knowledge storage in large language models (LLMs), focusing on the number of bits of knowledge a model can store per parameter. Unlike previous studies that evaluate models through loss or benchmarks, the authors estimate the number of bits of knowledge stored by a model, specifically in the form of tuples such as (USA, capital, Washington D.C.) from Wikipedia pages. Through controlled experiments on multiple datasets, they find that language models can store only about 2 bits of knowledge per parameter, even when quantized to int8. This means a 7B model can store 14B bits of knowledge, surpassing the combined knowledge of English Wikipedia and textbooks. The paper presents 12 results on how various factors affect a model's knowledge storage capacity, including training duration, model architecture, quantization, sparsity constraints (such as Mixture of Experts), and data signal-to-noise ratio. Key findings include: 1. **Training Duration**: Models require 1000 exposures to each knowledge piece to achieve a 2-bit/param capacity ratio, but undertrained models with only 100 exposures have a capacity ratio of 1-bit/param. 2. **Model Architecture**: GPT2, with rotary embedding and without dropout, matches or surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This is because LLaMA/Mistral use GatedMLP, which is less stable and harder to train. 3. **Quantization**: Quantizing models to int8 does not compromise capacity, even for models on the boundary of 2-bit/param. However, quantizing to int4 reduces capacity to 0.7-bit/param. 4. **Sparsity Constraints**: MoE models, even with 32 experts, only reduce capacity by about 1.3x compared to dense models, despite using just 8.8% of the total parameters during inference. 5. **Data Signal-to-Noise Ratio**: Junk data significantly reduces model capacity. Prepending domain names to training data significantly increases the model's knowledge capacity by allowing it to identify and prioritize domains rich in knowledge. The paper also discusses the impact of different hyperparameters on knowledge complexity and provides a principled framework to study scaling laws, offering insights into how various factors influence a model's knowledge storage capacity.This paper explores the scaling laws of knowledge storage in large language models (LLMs), focusing on the number of bits of knowledge a model can store per parameter. Unlike previous studies that evaluate models through loss or benchmarks, the authors estimate the number of bits of knowledge stored by a model, specifically in the form of tuples such as (USA, capital, Washington D.C.) from Wikipedia pages. Through controlled experiments on multiple datasets, they find that language models can store only about 2 bits of knowledge per parameter, even when quantized to int8. This means a 7B model can store 14B bits of knowledge, surpassing the combined knowledge of English Wikipedia and textbooks. The paper presents 12 results on how various factors affect a model's knowledge storage capacity, including training duration, model architecture, quantization, sparsity constraints (such as Mixture of Experts), and data signal-to-noise ratio. Key findings include: 1. **Training Duration**: Models require 1000 exposures to each knowledge piece to achieve a 2-bit/param capacity ratio, but undertrained models with only 100 exposures have a capacity ratio of 1-bit/param. 2. **Model Architecture**: GPT2, with rotary embedding and without dropout, matches or surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This is because LLaMA/Mistral use GatedMLP, which is less stable and harder to train. 3. **Quantization**: Quantizing models to int8 does not compromise capacity, even for models on the boundary of 2-bit/param. However, quantizing to int4 reduces capacity to 0.7-bit/param. 4. **Sparsity Constraints**: MoE models, even with 32 experts, only reduce capacity by about 1.3x compared to dense models, despite using just 8.8% of the total parameters during inference. 5. **Data Signal-to-Noise Ratio**: Junk data significantly reduces model capacity. Prepending domain names to training data significantly increases the model's knowledge capacity by allowing it to identify and prioritize domains rich in knowledge. The paper also discusses the impact of different hyperparameters on knowledge complexity and provides a principled framework to study scaling laws, offering insights into how various factors influence a model's knowledge storage capacity.

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

April 7, 2024 | Zeyuan Allen-Zhu, Yuanzhi Li