Understanding On the Variance of the Adaptive Learning Rate and Beyond

The paper "On the Variance of the Adaptive Learning Rate and Beyond" by Liyuan Liu explores the effectiveness of the warmup heuristic in stabilizing training, accelerating convergence, and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. The authors identify that the variance of adaptive learning rates is problematic in the early stages of training due to limited data, which can lead to convergence issues. They provide both empirical and theoretical evidence to support this hypothesis and propose Rectified Adam (RAdam), a novel variant of Adam that explicitly reduces the variance of the adaptive learning rate. RAdam is designed to address the issue of large variance in the early stages, which can cause the model to converge to suspicious or bad local optima. Experimental results on various benchmarks, including image classification, language modeling, and neural machine translation, demonstrate the efficacy and robustness of RAdam compared to vanilla Adam and other stabilization techniques. The paper also includes a theoretical analysis to justify the warmup heuristic and the variance reduction mechanism in RAdam.The paper "On the Variance of the Adaptive Learning Rate and Beyond" by Liyuan Liu explores the effectiveness of the warmup heuristic in stabilizing training, accelerating convergence, and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. The authors identify that the variance of adaptive learning rates is problematic in the early stages of training due to limited data, which can lead to convergence issues. They provide both empirical and theoretical evidence to support this hypothesis and propose Rectified Adam (RAdam), a novel variant of Adam that explicitly reduces the variance of the adaptive learning rate. RAdam is designed to address the issue of large variance in the early stages, which can cause the model to converge to suspicious or bad local optima. Experimental results on various benchmarks, including image classification, language modeling, and neural machine translation, demonstrate the efficacy and robustness of RAdam compared to vanilla Adam and other stabilization techniques. The paper also includes a theoretical analysis to justify the warmup heuristic and the variance reduction mechanism in RAdam.

ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND

26 Oct 2021 | Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han