27 May 2019 | Sanjeev Arora*, Simon S. Du†, Wei Hu‡, Zhiyuan Li§, Ruosong Wang♣
This paper presents a fine-grained analysis of the optimization and generalization properties of overparameterized two-layer neural networks trained with gradient descent (GD). The study focuses on a two-layer ReLU network with random initialization and provides several key insights:
1. **Training Speed and Label Type**: The paper shows that training with random labels leads to slower convergence compared to true labels. This is attributed to the tighter characterization of training dynamics, which reveals that the convergence rate depends on the projections of data labels onto eigenvectors of a Gram matrix derived from the ReLU kernel.
2. **Generalization Bound**: The paper introduces a data-dependent complexity measure that distinguishes between true and random labels. This measure is independent of network size and allows for a generalization bound that is completely independent of the number of hidden units. Experiments on MNIST and CIFAR datasets confirm that this measure effectively differentiates true and random labels.
3. **Learnability of Smooth Functions**: The paper demonstrates that a broad class of smooth functions, including linear functions and those with polynomial or cosine activations, can be learned by two-layer ReLU networks trained with GD. This is achieved by analyzing the optimization and generalization properties of the network.
4. **Kernel-Based Analysis**: The analysis leverages properties of a kernel associated with the ReLU activation, which helps in understanding the dynamics of training and generalization. This kernel-based approach provides insights into the behavior of GD in overparameterized settings.
The paper also addresses two key questions: (1) why true labels lead to faster convergence than random labels, and (2) how to verify the generalization ability of a neural network. The results show that the convergence rate is determined by the alignment of labels with the eigenvectors of the Gram matrix, while the generalization bound is based on a data-dependent complexity measure that can be computed without training the network.
The study contributes to the understanding of why deep neural networks can fit any data and generalize despite being overparameterized, providing theoretical insights that complement recent empirical findings.This paper presents a fine-grained analysis of the optimization and generalization properties of overparameterized two-layer neural networks trained with gradient descent (GD). The study focuses on a two-layer ReLU network with random initialization and provides several key insights:
1. **Training Speed and Label Type**: The paper shows that training with random labels leads to slower convergence compared to true labels. This is attributed to the tighter characterization of training dynamics, which reveals that the convergence rate depends on the projections of data labels onto eigenvectors of a Gram matrix derived from the ReLU kernel.
2. **Generalization Bound**: The paper introduces a data-dependent complexity measure that distinguishes between true and random labels. This measure is independent of network size and allows for a generalization bound that is completely independent of the number of hidden units. Experiments on MNIST and CIFAR datasets confirm that this measure effectively differentiates true and random labels.
3. **Learnability of Smooth Functions**: The paper demonstrates that a broad class of smooth functions, including linear functions and those with polynomial or cosine activations, can be learned by two-layer ReLU networks trained with GD. This is achieved by analyzing the optimization and generalization properties of the network.
4. **Kernel-Based Analysis**: The analysis leverages properties of a kernel associated with the ReLU activation, which helps in understanding the dynamics of training and generalization. This kernel-based approach provides insights into the behavior of GD in overparameterized settings.
The paper also addresses two key questions: (1) why true labels lead to faster convergence than random labels, and (2) how to verify the generalization ability of a neural network. The results show that the convergence rate is determined by the alignment of labels with the eigenvectors of the Gram matrix, while the generalization bound is based on a data-dependent complexity measure that can be computed without training the network.
The study contributes to the understanding of why deep neural networks can fit any data and generalize despite being overparameterized, providing theoretical insights that complement recent empirical findings.