[slides] Gradient Descent Finds Global Minima of Deep Neural Networks

The paper demonstrates that gradient descent can achieve zero training loss in polynomial time for deep over-parameterized neural networks with residual connections (ResNet). The analysis relies on the structure of the Gram matrix induced by the neural network architecture, showing its stability throughout training. This stability implies the global optimality of the gradient descent algorithm. The paper extends this analysis to deep residual convolutional neural networks and obtains similar convergence results. The key contributions include proving that gradient descent converges to zero training loss for fully connected networks and ResNet architectures, with the latter showing improved dependence on the number of layers compared to fully connected networks. The proof techniques involve analyzing the dynamics of the Gram matrix and showing its stability under over-parametrization.The paper demonstrates that gradient descent can achieve zero training loss in polynomial time for deep over-parameterized neural networks with residual connections (ResNet). The analysis relies on the structure of the Gram matrix induced by the neural network architecture, showing its stability throughout training. This stability implies the global optimality of the gradient descent algorithm. The paper extends this analysis to deep residual convolutional neural networks and obtains similar convergence results. The key contributions include proving that gradient descent converges to zero training loss for fully connected networks and ResNet architectures, with the latter showing improved dependence on the number of layers compared to fully connected networks. The proof techniques involve analyzing the dynamics of the Gram matrix and showing its stability under over-parametrization.

Gradient Descent Finds Global Minima of Deep Neural Networks

28 May 2019 | Simon S. Du * 1 Jason D. Lee * 2 Haochuan Li * 3 4 Liwei Wang * 5 4 Xiyu Zhai * 6