A Convergence Theory for Deep Learning via Over-Parameterization

A Convergence Theory for Deep Learning via Over-Parameterization

November 9, 2018 | Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
This paper presents a theoretical framework to explain why stochastic gradient descent (SGD) can find global minima in polynomial time for training deep neural networks (DNNs) under certain conditions. The key assumptions are that the inputs are non-degenerate and the network is over-parameterized, meaning the number of neurons in each layer is polynomial in the number of layers and samples. The main techniques involve proving that the optimization landscape is almost convex and semi-smooth near the random initialization, even with ReLU activations. This equivalence is established between over-parameterized neural networks and the neural tangent kernel (NTK) in the finite width setting. The results show that SGD can achieve 100% training accuracy in classification tasks or minimize regression loss linearly with respect to the number of samples and layers, with running time polynomial in these quantities. The theory applies to various network architectures, including fully-connected, convolutional, and residual neural networks, and to smooth loss functions. The paper also discusses the implications for generalization and provides a stability theory against small adversarial perturbations.This paper presents a theoretical framework to explain why stochastic gradient descent (SGD) can find global minima in polynomial time for training deep neural networks (DNNs) under certain conditions. The key assumptions are that the inputs are non-degenerate and the network is over-parameterized, meaning the number of neurons in each layer is polynomial in the number of layers and samples. The main techniques involve proving that the optimization landscape is almost convex and semi-smooth near the random initialization, even with ReLU activations. This equivalence is established between over-parameterized neural networks and the neural tangent kernel (NTK) in the finite width setting. The results show that SGD can achieve 100% training accuracy in classification tasks or minimize regression loss linearly with respect to the number of samples and layers, with running time polynomial in these quantities. The theory applies to various network architectures, including fully-connected, convolutional, and residual neural networks, and to smooth loss functions. The paper also discusses the implications for generalization and provides a stability theory against small adversarial perturbations.
Reach us at info@study.space
Understanding A Convergence Theory for Deep Learning via Over-Parameterization