19 Apr 2019 | Sashank J. Reddi, Satyen Kale & Sanjiv Kumar
This paper analyzes the convergence behavior of the ADAM optimization algorithm and identifies a key issue with its exponential moving average mechanism. The authors show that ADAM can fail to converge to an optimal solution in certain convex settings, which contrad the claims of convergence in previous studies. They demonstrate that the use of exponential moving averages in ADAM can lead to non-convergence by providing a concrete example of a simple convex optimization problem where ADAM does not converge to the optimal solution. The paper also highlights that the convergence issues can be addressed by endowing the algorithm with "long-term memory" of past gradients, and proposes new variants of ADAM that not only fix the convergence issues but also often lead to improved empirical performance. The authors also provide empirical results showing that their proposed variant, AMSGRAD, performs well on both synthetic and real-world datasets. The paper concludes that while ADAM has been widely used in practice, its convergence properties are not guaranteed and that modifications are needed to ensure convergence in certain settings.This paper analyzes the convergence behavior of the ADAM optimization algorithm and identifies a key issue with its exponential moving average mechanism. The authors show that ADAM can fail to converge to an optimal solution in certain convex settings, which contrad the claims of convergence in previous studies. They demonstrate that the use of exponential moving averages in ADAM can lead to non-convergence by providing a concrete example of a simple convex optimization problem where ADAM does not converge to the optimal solution. The paper also highlights that the convergence issues can be addressed by endowing the algorithm with "long-term memory" of past gradients, and proposes new variants of ADAM that not only fix the convergence issues but also often lead to improved empirical performance. The authors also provide empirical results showing that their proposed variant, AMSGRAD, performs well on both synthetic and real-world datasets. The paper concludes that while ADAM has been widely used in practice, its convergence properties are not guaranteed and that modifications are needed to ensure convergence in certain settings.