On Exact Computation with an Infinitely Wide Neural Net

On Exact Computation with an Infinitely Wide Neural Net

4 Nov 2019 | Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang
The paper "On Exact Computation with an Infinitely Wide Neural Net" by Sanjeev Arora explores the performance of deep neural networks (NNs) when their width, specifically the number of channels in convolutional layers and nodes in fully-connected layers, is allowed to increase to infinity. This exploration is motivated by the quest to understand the theoretical underpinnings of deep learning, particularly in terms of optimization and generalization. The authors introduce the concept of the Neural Tangent Kernel (NTK), which captures the behavior of fully-connected deep NNs in the infinite width limit trained by gradient descent. They extend this idea to convolutional NNs, defining the Convolutional NTK (CNTK). The main contributions of the paper include: 1. **Efficient Algorithm for CNTK**: The authors present an exact and efficient dynamic programming algorithm to compute CNTKs for ReLU activation functions. This algorithm, combined with GPU implementation tricks, allows for the computation of CNTKs on modern datasets like CIFAR-10. 2. **Performance Benchmark**: Using this algorithm, the authors achieve a performance benchmark for pure kernel-based methods on CIFAR-10 that is 10% higher than the best reported performance of a Gaussian process with a fixed kernel and only 6% lower than the performance of the corresponding finite deep net architecture (after disabling batch normalization and data augmentation). 3. **Non-Asymptotic Proof**: The paper provides the first non-asymptotic proof showing that a fully-trained sufficiently wide net is equivalent to the kernel regression predictor using NTK under weaker conditions than previous proofs. The authors also highlight the connection between deep NNs and Gaussian processes, noting that a single hidden-layer NN with i.i.d. random parameters in the limit of infinite width is a function drawn from a Gaussian process. They discuss the limitations of previous work, which often focuses on weakly-trained nets or finite-width NNs, and demonstrate that random feature methods for approximating CNTK do not compute good approximations, as evidenced by their poor performance on CIFAR-10.The paper "On Exact Computation with an Infinitely Wide Neural Net" by Sanjeev Arora explores the performance of deep neural networks (NNs) when their width, specifically the number of channels in convolutional layers and nodes in fully-connected layers, is allowed to increase to infinity. This exploration is motivated by the quest to understand the theoretical underpinnings of deep learning, particularly in terms of optimization and generalization. The authors introduce the concept of the Neural Tangent Kernel (NTK), which captures the behavior of fully-connected deep NNs in the infinite width limit trained by gradient descent. They extend this idea to convolutional NNs, defining the Convolutional NTK (CNTK). The main contributions of the paper include: 1. **Efficient Algorithm for CNTK**: The authors present an exact and efficient dynamic programming algorithm to compute CNTKs for ReLU activation functions. This algorithm, combined with GPU implementation tricks, allows for the computation of CNTKs on modern datasets like CIFAR-10. 2. **Performance Benchmark**: Using this algorithm, the authors achieve a performance benchmark for pure kernel-based methods on CIFAR-10 that is 10% higher than the best reported performance of a Gaussian process with a fixed kernel and only 6% lower than the performance of the corresponding finite deep net architecture (after disabling batch normalization and data augmentation). 3. **Non-Asymptotic Proof**: The paper provides the first non-asymptotic proof showing that a fully-trained sufficiently wide net is equivalent to the kernel regression predictor using NTK under weaker conditions than previous proofs. The authors also highlight the connection between deep NNs and Gaussian processes, noting that a single hidden-layer NN with i.i.d. random parameters in the limit of infinite width is a function drawn from a Gaussian process. They discuss the limitations of previous work, which often focuses on weakly-trained nets or finite-width NNs, and demonstrate that random feature methods for approximating CNTK do not compute good approximations, as evidenced by their poor performance on CIFAR-10.
Reach us at info@study.space