10 Jun 2024 | Daniel Kunin*1 Allan Raventós*1 Clémentine Domine2 Feng Chen1 David Klindt3 Andrew Saxe2 Surya Ganguli1
This paper explores the mechanisms behind the rapid feature learning in modern neural networks, focusing on the transition between the lazy and rich learning regimes. The authors derive exact solutions for a minimal model that transitions between these regimes, revealing how unbalanced layer-specific initialization variances and learning rates influence the degree of feature learning. They find that balanced initializations, where all layers learn at similar speeds, lead to rich learning in linear networks, while unbalanced initializations can accelerate rich learning in nonlinear networks. Through experiments, they demonstrate that upstream initializations (faster learning in earlier layers) drive feature learning in deep finite-width networks, enhance interpretability in early layers of CNNs, reduce sample complexity for hierarchical data, and decrease the time to grokking in modular arithmetic. The study provides insights into the inductive biases of both regimes and the transition between them, highlighting the importance of unbalanced initializations in optimizing efficient feature learning.This paper explores the mechanisms behind the rapid feature learning in modern neural networks, focusing on the transition between the lazy and rich learning regimes. The authors derive exact solutions for a minimal model that transitions between these regimes, revealing how unbalanced layer-specific initialization variances and learning rates influence the degree of feature learning. They find that balanced initializations, where all layers learn at similar speeds, lead to rich learning in linear networks, while unbalanced initializations can accelerate rich learning in nonlinear networks. Through experiments, they demonstrate that upstream initializations (faster learning in earlier layers) drive feature learning in deep finite-width networks, enhance interpretability in early layers of CNNs, reduce sample complexity for hierarchical data, and decrease the time to grokking in modular arithmetic. The study provides insights into the inductive biases of both regimes and the transition between them, highlighting the importance of unbalanced initializations in optimizing efficient feature learning.