This paper investigates the scaling laws of language models, focusing on the relationship between model size and the total bits of knowledge stored. The authors propose a principled framework to examine how model size affects knowledge storage capacity. They define a "piece of knowledge" as a (name, attribute, value) tuple and estimate the number of knowledge bits a model can store. Through multiple controlled datasets, they find that language models can store 2 bits of knowledge per parameter, even when quantized to int8. This suggests that a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined.
The paper presents 12 results on how various factors affect a model's knowledge storage capacity. These include training duration, model architecture, quantization, sparsity constraints like MoE, and data signal-to-noise ratio. Key findings include that the GPT-2 architecture, with rotary embedding, matches or surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model’s knowledge capacity, as language models can autonomously identify and prioritize domains rich in knowledge.
The paper also explores how quantization, sparsity (MoE), and junk data affect model capacity. Quantizing to int8 does not compromise model capacity, while quantizing to int4 reduces capacity to 0.7 bits per parameter. MoE models, even with 32 experts, only reduce capacity by 1.3x compared to base scaling laws. Junk data significantly reduces model capacity, but adding a special token to useful knowledge can mitigate this degradation.
The authors conclude that the 2 bits per parameter capacity ratio is a universal law among most typical language model architectures. They also highlight that the capacity ratio is influenced by various hyperparameters, including training duration, model architecture, and data quality. The paper provides a more accurate and principled playground for comparing model architectures, training techniques, and data quality, which can assist practitioners in making informed decisions about model selection and training data preparation.This paper investigates the scaling laws of language models, focusing on the relationship between model size and the total bits of knowledge stored. The authors propose a principled framework to examine how model size affects knowledge storage capacity. They define a "piece of knowledge" as a (name, attribute, value) tuple and estimate the number of knowledge bits a model can store. Through multiple controlled datasets, they find that language models can store 2 bits of knowledge per parameter, even when quantized to int8. This suggests that a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined.
The paper presents 12 results on how various factors affect a model's knowledge storage capacity. These include training duration, model architecture, quantization, sparsity constraints like MoE, and data signal-to-noise ratio. Key findings include that the GPT-2 architecture, with rotary embedding, matches or surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model’s knowledge capacity, as language models can autonomously identify and prioritize domains rich in knowledge.
The paper also explores how quantization, sparsity (MoE), and junk data affect model capacity. Quantizing to int8 does not compromise model capacity, while quantizing to int4 reduces capacity to 0.7 bits per parameter. MoE models, even with 32 experts, only reduce capacity by 1.3x compared to base scaling laws. Junk data significantly reduces model capacity, but adding a special token to useful knowledge can mitigate this degradation.
The authors conclude that the 2 bits per parameter capacity ratio is a universal law among most typical language model architectures. They also highlight that the capacity ratio is influenced by various hyperparameters, including training duration, model architecture, and data quality. The paper provides a more accurate and principled playground for comparing model architectures, training techniques, and data quality, which can assist practitioners in making informed decisions about model selection and training data preparation.