27 May 2019 | Sanjeev Arora*, Simon S. Du†, Wei Hu‡, Zhiyuan Li§, Ruosong Wang♣
This paper provides a detailed analysis of the optimization and generalization of overparameterized two-layer ReLU neural networks trained by gradient descent. The key contributions include:
1. **Training Speed**: A tighter characterization of the training speed, explaining why training with random labels leads to slower convergence compared to true labels.
2. **Generalization Bound**: A generalization bound that is independent of the network size, using a data-dependent complexity measure. This measure distinguishes between true and random labels on datasets like MNIST and CIFAR.
3. **Learnability**: A broad class of smooth functions can be learned by two-layer ReLU networks trained via gradient descent.
The analysis tracks the dynamics of the training process through properties of a related kernel, providing insights into why true labels lead to faster convergence and better generalization. The paper also discusses the limitations of previous work and highlights the practical implications of the findings.This paper provides a detailed analysis of the optimization and generalization of overparameterized two-layer ReLU neural networks trained by gradient descent. The key contributions include:
1. **Training Speed**: A tighter characterization of the training speed, explaining why training with random labels leads to slower convergence compared to true labels.
2. **Generalization Bound**: A generalization bound that is independent of the network size, using a data-dependent complexity measure. This measure distinguishes between true and random labels on datasets like MNIST and CIFAR.
3. **Learnability**: A broad class of smooth functions can be learned by two-layer ReLU networks trained via gradient descent.
The analysis tracks the dynamics of the training process through properties of a related kernel, providing insights into why true labels lead to faster convergence and better generalization. The paper also discusses the limitations of previous work and highlights the practical implications of the findings.