[slides and audio] Learning to learn by gradient descent by gradient descent

This paper explores the idea of learning optimization algorithms, specifically gradient descent, by treating the design of these algorithms as a learning problem. The authors propose using Long Short-Term Memory (LSTM) networks to learn update rules that can exploit the structure of specific classes of optimization problems. The learned optimizers, implemented using LSTMs, outperform hand-designed optimizers on tasks such as simple convex problems, training neural networks, and styling images with neural art. The paper demonstrates the effectiveness of learned optimizers through various experiments, including comparisons with standard optimizers like SGD, RMSprop, ADAM, and NAG. The learned optimizers show significant performance improvements and exhibit strong generalization capabilities, even when applied to tasks with different architectures or datasets. The authors also discuss the challenges and potential extensions, such as learning more sophisticated optimizers that take into account correlations between parameters.This paper explores the idea of learning optimization algorithms, specifically gradient descent, by treating the design of these algorithms as a learning problem. The authors propose using Long Short-Term Memory (LSTM) networks to learn update rules that can exploit the structure of specific classes of optimization problems. The learned optimizers, implemented using LSTMs, outperform hand-designed optimizers on tasks such as simple convex problems, training neural networks, and styling images with neural art. The paper demonstrates the effectiveness of learned optimizers through various experiments, including comparisons with standard optimizers like SGD, RMSprop, ADAM, and NAG. The learned optimizers show significant performance improvements and exhibit strong generalization capabilities, even when applied to tasks with different architectures or datasets. The authors also discuss the challenges and potential extensions, such as learning more sophisticated optimizers that take into account correlations between parameters.

Learning to learn by gradient descent by gradient descent

30 Nov 2016 | Marcin Andrychowicz, Misha Denil, Sergio Gómez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas