[slides and audio] Asymptotics of feature learning in two-layer networks after one gradient-step

This paper investigates the asymptotic behavior of two-layer neural networks after a single gradient descent step, focusing on how these networks learn features from data and improve beyond the kernel regime. The authors model the trained network using a spiked Random Features (sRF) model, leveraging insights from previous work on Gaussian universality. They provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit, where the number of samples, width, and input dimension grow proportionally. This characterization captures the learning curves of the original network model, highlighting the importance of adapting to the data for efficient non-linear function approximation in the direction of the gradient. The main contributions include: 1. **Exact Asymptotics for sRF**: Sharp asymptotic characterization of the test error for the sRF model, along with summary statistics, showing why sRFs are good approximations for two-layer neural networks trained with a large learning rate. 2. **Conditional Gaussian Equivalence**: Show that the learning properties of the sRF model are asymptotically equivalent to a simple conditional Gaussian model in the high-dimensional proportional regime, characterized by the projections of the input data on the spike in the weight matrix. 3. **Feature Learning**: Extensive discussion on how feature learning leads to significant improvements in generalization performance over random features in a data-limited regime, demonstrating a clear separation with respect to kernel methods and random feature models. The paper also provides numerical support for the theoretical findings, illustrating the learning curves of two-layer networks trained with a single large gradient step followed by ridge regression on the readout weights. The results offer valuable insights into the mechanism behind feature learning in two-layer neural networks and highlight the advantages of feature learning over the lazy regime.This paper investigates the asymptotic behavior of two-layer neural networks after a single gradient descent step, focusing on how these networks learn features from data and improve beyond the kernel regime. The authors model the trained network using a spiked Random Features (sRF) model, leveraging insights from previous work on Gaussian universality. They provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit, where the number of samples, width, and input dimension grow proportionally. This characterization captures the learning curves of the original network model, highlighting the importance of adapting to the data for efficient non-linear function approximation in the direction of the gradient. The main contributions include: 1. **Exact Asymptotics for sRF**: Sharp asymptotic characterization of the test error for the sRF model, along with summary statistics, showing why sRFs are good approximations for two-layer neural networks trained with a large learning rate. 2. **Conditional Gaussian Equivalence**: Show that the learning properties of the sRF model are asymptotically equivalent to a simple conditional Gaussian model in the high-dimensional proportional regime, characterized by the projections of the input data on the spike in the weight matrix. 3. **Feature Learning**: Extensive discussion on how feature learning leads to significant improvements in generalization performance over random features in a data-limited regime, demonstrating a clear separation with respect to kernel methods and random feature models. The paper also provides numerical support for the theoretical findings, illustrating the learning curves of two-layer networks trained with a single large gradient step followed by ridge regression on the readout weights. The results offer valuable insights into the mechanism behind feature learning in two-layer neural networks and highlight the advantages of feature learning over the lazy regime.

Asymptotics of feature learning in two-layer networks after one gradient-step

4 Jun 2024 | Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M. Lu, Lenka Zdeborová, and Bruno Loureiro