2 Apr 2015 | Dougal Maclaurin, David Duvenaud, Ryan P. Adams
This paper introduces a method for computing exact gradients of cross-validation performance with respect to all hyperparameters by reversing the dynamics of stochastic gradient descent with momentum. The method allows for the optimization of thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, regularization schemes, and neural network architectures. The key idea is to reverse the training process to compute gradients, which is feasible due to the reversibility of the momentum-based stochastic gradient descent algorithm. This approach reduces memory requirements significantly, as demonstrated by the ability to store only a small number of auxiliary bits for finite precision arithmetic. The method is applied to various tasks, including optimizing learning rate schedules, weight initialization scales, regularization parameters, and training data. The results show that hypergradients can be used to optimize complex hyperparameter configurations, leading to improved performance in neural network training. However, the method has limitations, such as the difficulty of computing gradients for discrete hyperparameters and the potential for overfitting the validation objective. The paper also discusses related work and future directions, including the application of hypergradients to other learning methods and architectures.This paper introduces a method for computing exact gradients of cross-validation performance with respect to all hyperparameters by reversing the dynamics of stochastic gradient descent with momentum. The method allows for the optimization of thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, regularization schemes, and neural network architectures. The key idea is to reverse the training process to compute gradients, which is feasible due to the reversibility of the momentum-based stochastic gradient descent algorithm. This approach reduces memory requirements significantly, as demonstrated by the ability to store only a small number of auxiliary bits for finite precision arithmetic. The method is applied to various tasks, including optimizing learning rate schedules, weight initialization scales, regularization parameters, and training data. The results show that hypergradients can be used to optimize complex hyperparameter configurations, leading to improved performance in neural network training. However, the method has limitations, such as the difficulty of computing gradients for discrete hyperparameters and the potential for overfitting the validation objective. The paper also discusses related work and future directions, including the application of hypergradients to other learning methods and architectures.