4 Jun 2024 | Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M. Lu, Lenka Zdeborová, Bruno Loureiro
This paper investigates how two-layer neural networks learn features from data and improve over the kernel regime after a single gradient descent step. The authors model the trained network using a spiked Random Features (sRF) model, which captures the learning behavior of the original network. They provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit, where the number of samples, width, and input dimension grow proportionally. This characterization closely matches the learning curves of the original network, enabling a deeper understanding of how feature learning allows the network to efficiently learn non-linear functions in the direction of the gradient, where at initialization it can only express linear functions.
The authors show that the learning properties of the sRF model are asymptotically equivalent to a conditional Gaussian model in the high-dimensional proportional regime. This equivalence is supported by numerical evidence and extends previous theoretical results on Gaussian equivalence for random features. They also demonstrate that feature learning leads to a significant improvement in generalization performance over random features, especially in data-limited regimes. The analysis quantitatively illustrates the benefits of feature learning over the lazy regime.
The paper provides a sharp asymptotic treatment of a setting where feature learning is modeled in a non-perturbative, high-dimensional regime, with a model able to express non-linear functions beyond polynomials. The authors derive exact asymptotics for the sRF, discuss conditional Gaussian equivalence, and provide bounds on the generalization error. They also show that the test error can be characterized through a system of equations that capture the behavior of the network in the high-dimensional limit.
The results are supported by numerical experiments, which show that the theoretical characterization closely matches the learning curves of the original network. The paper concludes that the sRF model provides a valuable analytical tool for understanding the learning and behavior of two-layer neural networks after a single gradient step. The findings highlight the importance of feature learning in improving the performance of neural networks beyond the kernel regime.This paper investigates how two-layer neural networks learn features from data and improve over the kernel regime after a single gradient descent step. The authors model the trained network using a spiked Random Features (sRF) model, which captures the learning behavior of the original network. They provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit, where the number of samples, width, and input dimension grow proportionally. This characterization closely matches the learning curves of the original network, enabling a deeper understanding of how feature learning allows the network to efficiently learn non-linear functions in the direction of the gradient, where at initialization it can only express linear functions.
The authors show that the learning properties of the sRF model are asymptotically equivalent to a conditional Gaussian model in the high-dimensional proportional regime. This equivalence is supported by numerical evidence and extends previous theoretical results on Gaussian equivalence for random features. They also demonstrate that feature learning leads to a significant improvement in generalization performance over random features, especially in data-limited regimes. The analysis quantitatively illustrates the benefits of feature learning over the lazy regime.
The paper provides a sharp asymptotic treatment of a setting where feature learning is modeled in a non-perturbative, high-dimensional regime, with a model able to express non-linear functions beyond polynomials. The authors derive exact asymptotics for the sRF, discuss conditional Gaussian equivalence, and provide bounds on the generalization error. They also show that the test error can be characterized through a system of equations that capture the behavior of the network in the high-dimensional limit.
The results are supported by numerical experiments, which show that the theoretical characterization closely matches the learning curves of the original network. The paper concludes that the sRF model provides a valuable analytical tool for understanding the learning and behavior of two-layer neural networks after a single gradient step. The findings highlight the importance of feature learning in improving the performance of neural networks beyond the kernel regime.