Gradient Descent Finds Global Minima of Deep Neural Networks

Gradient Descent Finds Global Minima of Deep Neural Networks

28 May 2019 | Simon S. Du * 1 Jason D. Lee * 2 Haochuan Li * 3 4 Liwei Wang * 5 4 Xiyu Zhai * 6
This paper shows that gradient descent can find the global minimum of the training loss for deep neural networks, even though the objective function is non-convex. The authors prove that gradient descent achieves zero training loss in polynomial time for deep over-parameterized neural networks with residual connections (ResNet). Their analysis relies on the structure of the Gram matrix induced by the neural network architecture, which is stable throughout training and implies the global optimality of gradient descent. They extend their analysis to deep residual convolutional neural networks and obtain similar convergence results. The paper addresses two mysteries in deep learning: why randomly initialized first-order methods like gradient descent achieve zero training loss, and why deeper networks are harder to train. The first mystery is explained by over-parameterization, which allows neural networks to fit all training data. The second is addressed by the ResNet architecture, which enables training with many layers. The authors consider a setting with n data points and H layers of width m. They focus on the least-squares loss and assume the activation function is Lipschitz and smooth. Their contributions include showing that for fully-connected networks, if m is sufficiently large, gradient descent converges to zero training loss at a linear rate. For ResNet, they show that if m is sufficiently large, gradient descent also converges to zero training loss at a linear rate, with improved dependence on the number of layers. They further apply these techniques to convolutional ResNet. The authors build on previous work by using the observation that over-parameterized networks have weights close to their initialization. They analyze the dynamics of predictions determined by the least eigenvalue of the Gram matrix induced by the neural network architecture. They show that bounding the distance of each weight matrix from its initialization is sufficient to lower bound the least eigenvalue. The paper extends previous results by considering deep networks, ResNet architectures, and convolutional networks. It also improves the width dependence on sample size n by using smooth activations. The authors show that their results require less over-parameterization than previous works. They also compare their results with concurrent work, showing that their approach requires less width and iteration complexity. The paper concludes that gradient descent can find the global minimum of the training loss for deep neural networks under certain conditions.This paper shows that gradient descent can find the global minimum of the training loss for deep neural networks, even though the objective function is non-convex. The authors prove that gradient descent achieves zero training loss in polynomial time for deep over-parameterized neural networks with residual connections (ResNet). Their analysis relies on the structure of the Gram matrix induced by the neural network architecture, which is stable throughout training and implies the global optimality of gradient descent. They extend their analysis to deep residual convolutional neural networks and obtain similar convergence results. The paper addresses two mysteries in deep learning: why randomly initialized first-order methods like gradient descent achieve zero training loss, and why deeper networks are harder to train. The first mystery is explained by over-parameterization, which allows neural networks to fit all training data. The second is addressed by the ResNet architecture, which enables training with many layers. The authors consider a setting with n data points and H layers of width m. They focus on the least-squares loss and assume the activation function is Lipschitz and smooth. Their contributions include showing that for fully-connected networks, if m is sufficiently large, gradient descent converges to zero training loss at a linear rate. For ResNet, they show that if m is sufficiently large, gradient descent also converges to zero training loss at a linear rate, with improved dependence on the number of layers. They further apply these techniques to convolutional ResNet. The authors build on previous work by using the observation that over-parameterized networks have weights close to their initialization. They analyze the dynamics of predictions determined by the least eigenvalue of the Gram matrix induced by the neural network architecture. They show that bounding the distance of each weight matrix from its initialization is sufficient to lower bound the least eigenvalue. The paper extends previous results by considering deep networks, ResNet architectures, and convolutional networks. It also improves the width dependence on sample size n by using smooth activations. The authors show that their results require less over-parameterization than previous works. They also compare their results with concurrent work, showing that their approach requires less width and iteration complexity. The paper concludes that gradient descent can find the global minimum of the training loss for deep neural networks under certain conditions.
Reach us at info@study.space
[slides] Gradient Descent Finds Global Minima of Deep Neural Networks | StudySpace