19 Apr 2019 | Sashank J. Reddi, Satyen Kale & Sanjiv Kumar
The paper "On the Convergence of Adam and Beyond" by Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar from Google New York explores the convergence issues of stochastic optimization methods, particularly Adam, RMSprop, ADADELTA, and NADAM, which are widely used in training deep networks. The authors identify that these algorithms often fail to converge to optimal solutions in certain settings, such as learning with large output spaces. They provide a rigorous analysis showing that the exponential moving averages used in these algorithms can cause non-convergence in simple convex optimization problems. Specifically, they demonstrate that the algorithm's learning rate can become indefinite, leading to poor convergence behavior. To address this issue, the authors propose new variants of the ADAM algorithm that incorporate "long-term memory" of past gradients, ensuring guaranteed convergence. These new variants not only fix the convergence issues but also improve empirical performance. The paper includes theoretical proofs and empirical studies to support these findings, highlighting the importance of understanding the theoretical foundations of these popular optimization algorithms.The paper "On the Convergence of Adam and Beyond" by Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar from Google New York explores the convergence issues of stochastic optimization methods, particularly Adam, RMSprop, ADADELTA, and NADAM, which are widely used in training deep networks. The authors identify that these algorithms often fail to converge to optimal solutions in certain settings, such as learning with large output spaces. They provide a rigorous analysis showing that the exponential moving averages used in these algorithms can cause non-convergence in simple convex optimization problems. Specifically, they demonstrate that the algorithm's learning rate can become indefinite, leading to poor convergence behavior. To address this issue, the authors propose new variants of the ADAM algorithm that incorporate "long-term memory" of past gradients, ensuring guaranteed convergence. These new variants not only fix the convergence issues but also improve empirical performance. The paper includes theoretical proofs and empirical studies to support these findings, highlighting the importance of understanding the theoretical foundations of these popular optimization algorithms.