ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND

ON THE VARIANCE OF THE ADAPTIVE LEARNING RATE AND BEYOND

26 Oct 2021 | Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han
This paper investigates the variance issue in adaptive learning rates for optimization algorithms like Adam and its impact on training stability and convergence. The authors propose Rectified Adam (RAdam), a new variant of Adam that addresses this variance issue by explicitly rectifying the adaptive learning rate. The key insight is that the variance of the adaptive learning rate is large in the early stages of training due to limited data samples, leading to unstable training and suboptimal convergence. The paper provides both empirical and theoretical evidence to support this hypothesis. The authors analyze the variance of the adaptive learning rate and show that it decreases as the number of training samples increases. They propose RAdam, which rectifies the variance of the adaptive learning rate by adjusting the learning rate based on the estimated variance. RAdam is shown to improve the performance and stability of training across various tasks, including image classification, language modeling, and neural machine translation. The paper also compares RAdam to heuristic warmup strategies and shows that RAdam achieves similar performance while being more robust and requiring fewer hyperparameters. The authors also provide a theoretical analysis of the variance of the adaptive learning rate and show that it can be approximated using a scaled inverse chi-square distribution. This analysis supports the effectiveness of RAdam in reducing the variance of the adaptive learning rate and improving training stability. The paper concludes that the variance issue in adaptive learning rates is a fundamental problem that affects the performance of optimization algorithms and that RAdam provides a principled solution to this problem.This paper investigates the variance issue in adaptive learning rates for optimization algorithms like Adam and its impact on training stability and convergence. The authors propose Rectified Adam (RAdam), a new variant of Adam that addresses this variance issue by explicitly rectifying the adaptive learning rate. The key insight is that the variance of the adaptive learning rate is large in the early stages of training due to limited data samples, leading to unstable training and suboptimal convergence. The paper provides both empirical and theoretical evidence to support this hypothesis. The authors analyze the variance of the adaptive learning rate and show that it decreases as the number of training samples increases. They propose RAdam, which rectifies the variance of the adaptive learning rate by adjusting the learning rate based on the estimated variance. RAdam is shown to improve the performance and stability of training across various tasks, including image classification, language modeling, and neural machine translation. The paper also compares RAdam to heuristic warmup strategies and shows that RAdam achieves similar performance while being more robust and requiring fewer hyperparameters. The authors also provide a theoretical analysis of the variance of the adaptive learning rate and show that it can be approximated using a scaled inverse chi-square distribution. This analysis supports the effectiveness of RAdam in reducing the variance of the adaptive learning rate and improving training stability. The paper concludes that the variance issue in adaptive learning rates is a fundamental problem that affects the performance of optimization algorithms and that RAdam provides a principled solution to this problem.
Reach us at info@study.space
[slides and audio] On the Variance of the Adaptive Learning Rate and Beyond