A Convergence Theory for Deep Learning via Over-Parameterization

A Convergence Theory for Deep Learning via Over-Parameterization

November 9, 2018 | Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song
This paper presents a convergence theory for deep learning via over-parameterization. The authors prove that stochastic gradient descent (SGD) can find global minima on the training objective of deep neural networks (DNNs) in polynomial time under two assumptions: non-degenerate inputs and over-parameterization. Over-parameterization means the network width is sufficiently large, polynomial in the number of layers and the number of samples. The key technique is to show that the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernels (NTK) in the finite width setting. The authors show that SGD can attain 100% training accuracy in classification tasks or minimize regression loss in linear convergence speed with running time polynomial in the number of samples and layers. The theory applies to widely-used but non-smooth ReLU activation and any smooth and possibly non-convex loss functions. It applies to fully-connected neural networks, convolutional neural networks (CNNs), and residual neural networks (ResNets). The authors also show that neural networks with ReLU activation do not suffer from exponential gradient explosion or vanishing, which is key to avoiding exponential dependency on the number of layers. The paper also derives a stability theory of neural networks against small but adversarial perturbations. The results are extended to other settings, including different loss functions, convolutional networks, and residual networks. The authors show that their results can be generalized to many other settings, including different loss functions, convolutional networks, and residual networks. The paper also discusses the implications of their results for generalization and compares their results with other related works.This paper presents a convergence theory for deep learning via over-parameterization. The authors prove that stochastic gradient descent (SGD) can find global minima on the training objective of deep neural networks (DNNs) in polynomial time under two assumptions: non-degenerate inputs and over-parameterization. Over-parameterization means the network width is sufficiently large, polynomial in the number of layers and the number of samples. The key technique is to show that the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernels (NTK) in the finite width setting. The authors show that SGD can attain 100% training accuracy in classification tasks or minimize regression loss in linear convergence speed with running time polynomial in the number of samples and layers. The theory applies to widely-used but non-smooth ReLU activation and any smooth and possibly non-convex loss functions. It applies to fully-connected neural networks, convolutional neural networks (CNNs), and residual neural networks (ResNets). The authors also show that neural networks with ReLU activation do not suffer from exponential gradient explosion or vanishing, which is key to avoiding exponential dependency on the number of layers. The paper also derives a stability theory of neural networks against small but adversarial perturbations. The results are extended to other settings, including different loss functions, convolutional networks, and residual networks. The authors show that their results can be generalized to many other settings, including different loss functions, convolutional networks, and residual networks. The paper also discusses the implications of their results for generalization and compares their results with other related works.
Reach us at info@study.space