Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

5 Feb 2019 | Simon S. Du*, Xiyu Zhai*, Barnabás Poczos, Aarti Singh
This paper investigates why gradient descent can achieve zero training loss for two-layer neural networks with ReLU activation, even though the objective function is non-convex and non-smooth. The authors show that for a sufficiently large number of hidden nodes (m), and when no two input vectors are parallel, gradient descent converges to a globally optimal solution at a linear rate for the quadratic loss function. The key insight is that over-parameterization and random initialization ensure that each weight vector remains close to its initial value, allowing the use of a strong convexity-like property to guarantee linear convergence. The analysis relies on the stability of a Gram matrix, which is derived from the activation patterns of the network. The paper also compares its results with previous work and shows that the convergence rate can be achieved with a number of hidden nodes that is independent of the desired accuracy. The analysis is extended to both continuous and discrete time settings, and the results are validated through experiments on synthetic data. The findings suggest that over-parameterization and random initialization play a crucial role in the successful optimization of neural networks using first-order methods.This paper investigates why gradient descent can achieve zero training loss for two-layer neural networks with ReLU activation, even though the objective function is non-convex and non-smooth. The authors show that for a sufficiently large number of hidden nodes (m), and when no two input vectors are parallel, gradient descent converges to a globally optimal solution at a linear rate for the quadratic loss function. The key insight is that over-parameterization and random initialization ensure that each weight vector remains close to its initial value, allowing the use of a strong convexity-like property to guarantee linear convergence. The analysis relies on the stability of a Gram matrix, which is derived from the activation patterns of the network. The paper also compares its results with previous work and shows that the convergence rate can be achieved with a number of hidden nodes that is independent of the desired accuracy. The analysis is extended to both continuous and discrete time settings, and the results are validated through experiments on synthetic data. The findings suggest that over-parameterization and random initialization play a crucial role in the successful optimization of neural networks using first-order methods.
Reach us at info@study.space