10 Feb 2020 | Arthur Jacot, Franck Gabriel, Clément Hongler
This paper explores the connection between artificial neural networks (ANNs) and kernel methods, focusing on the Neural Tangent Kernel (NTK). The authors prove that during training, an ANN's function $f_{\theta}$ follows a kernel gradient descent path with respect to the NTK, which is a kernel that depends on the network's depth, nonlinearity, and initialization variance. In the infinite-width limit, the NTK converges to a deterministic kernel, making it possible to study training in function space. The positive-definiteness of this limiting NTK is shown to be related to the generalization properties of ANNs, particularly when the data is supported on a sphere and the non-linearity is non-polynomial. For least-squares regression, the network function $f_{\theta}$ follows a linear differential equation, and convergence is fastest along the largest kernel principal components of the input data, suggesting early stopping as a theoretical motivation. Numerical experiments validate these theoretical findings, showing that wide ANNs behave similarly to the theoretical limit.This paper explores the connection between artificial neural networks (ANNs) and kernel methods, focusing on the Neural Tangent Kernel (NTK). The authors prove that during training, an ANN's function $f_{\theta}$ follows a kernel gradient descent path with respect to the NTK, which is a kernel that depends on the network's depth, nonlinearity, and initialization variance. In the infinite-width limit, the NTK converges to a deterministic kernel, making it possible to study training in function space. The positive-definiteness of this limiting NTK is shown to be related to the generalization properties of ANNs, particularly when the data is supported on a sphere and the non-linearity is non-polynomial. For least-squares regression, the network function $f_{\theta}$ follows a linear differential equation, and convergence is fastest along the largest kernel principal components of the input data, suggesting early stopping as a theoretical motivation. Numerical experiments validate these theoretical findings, showing that wide ANNs behave similarly to the theoretical limit.