5 Feb 2019 | Simon S. Du*, Xiyu Zhai*, Barnabás Poczos, Aarti Singh
This paper addresses the mystery of why randomly initialized first-order methods, such as gradient descent, can achieve zero training loss in non-convex and non-smooth optimization problems, specifically for two-layer fully connected ReLU-activated neural networks. The authors show that as long as the number of hidden nodes \( m \) is large enough and no two inputs are parallel, gradient descent converges to a globally optimal solution for the quadratic loss function at a linear convergence rate. The key insight is that over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization, allowing the use of a strong convexity-like property to demonstrate the linear convergence rate. The analysis relies on the spectral properties of a Gram matrix, which is shown to be stable throughout the training process. The paper also provides a quantitative convergence rate and discusses the implications for deep neural networks.This paper addresses the mystery of why randomly initialized first-order methods, such as gradient descent, can achieve zero training loss in non-convex and non-smooth optimization problems, specifically for two-layer fully connected ReLU-activated neural networks. The authors show that as long as the number of hidden nodes \( m \) is large enough and no two inputs are parallel, gradient descent converges to a globally optimal solution for the quadratic loss function at a linear convergence rate. The key insight is that over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization, allowing the use of a strong convexity-like property to demonstrate the linear convergence rate. The analysis relies on the spectral properties of a Gram matrix, which is shown to be stable throughout the training process. The paper also provides a quantitative convergence rate and discusses the implications for deep neural networks.