[slides and audio] Critical Data Size of Language Models from a Grokking Perspective

The paper explores the critical data size in language models, a threshold that marks the transition from memorization to generalization. The authors formalize this phase transition as the Data Efficiency Hypothesis, identifying data insufficiency, sufficiency, and surplus regimes in training dynamics. They develop a grokking configuration to reproduce grokking on simplistic language models by rescaling initialization and weight decay. The study reveals that generalization occurs only when models reach a critical data size, and this critical point increases with model size. Experiments on various datasets, including modular addition, IMDB, Yelp, and Natural-Instructions, support the hypothesis. The findings deepen the understanding of language model training, highlighting the role of data in the learning mechanism.The paper explores the critical data size in language models, a threshold that marks the transition from memorization to generalization. The authors formalize this phase transition as the Data Efficiency Hypothesis, identifying data insufficiency, sufficiency, and surplus regimes in training dynamics. They develop a grokking configuration to reproduce grokking on simplistic language models by rescaling initialization and weight decay. The study reveals that generalization occurs only when models reach a critical data size, and this critical point increases with model size. Experiments on various datasets, including modular addition, IMDB, Yelp, and Natural-Instructions, support the hypothesis. The findings deepen the understanding of language model training, highlighting the role of data in the learning mechanism.

Critical Data Size of Language Models from a Grokking Perspective

23 May 2024 | Xuekai Zhu, Yao Fu, Bowen Zhou, Zhouhan Lin