Towards understanding convergence and generalization of AdamW

Towards understanding convergence and generalization of AdamW

2024 | Pan ZHOU, Xingyu XIE, Zouchen LIN, Shuicheng YAN
This paper investigates the convergence and generalization properties of AdamW, an adaptive gradient optimization algorithm that incorporates decoupled weight decay. AdamW modifies the Adam optimizer by decoupling weight decay from the optimization steps, which differs from the traditional $ \ell_{2} $-regularizer that affects the gradient moments. The authors prove that AdamW converges and minimizes a dynamically regularized loss that combines the original loss with a dynamic regularization term induced by the decoupled weight decay. This leads to different behavior compared to Adam and $ \ell_{2} $-regularized Adam ($ \ell_{2} $-Adam). On both general nonconvex problems and PL-conditioned problems, the authors establish the stochastic gradient complexity of AdamW to find a stationary point. This complexity is applicable to Adam and $ \ell_{2} $-Adam and improves their previously known complexity, especially for over-parameterized networks. Additionally, the authors show that AdamW enjoys smaller generalization errors than Adam and $ \ell_{2} $-Adam from the Bayesian posterior perspective. This result explicitly reveals the benefits of decoupled weight decay in AdamW. The paper also compares AdamW with $ \ell_{2} $-Adam, showing that AdamW often achieves better generalization performance. Theoretical analysis and experimental results validate these findings. The experiments demonstrate that AdamW performs well on ImageNet, with significant improvements in generalization compared to $ \ell_{2} $-Adam. The results suggest that the decoupled weight decay in AdamW contributes to better convergence and generalization properties.This paper investigates the convergence and generalization properties of AdamW, an adaptive gradient optimization algorithm that incorporates decoupled weight decay. AdamW modifies the Adam optimizer by decoupling weight decay from the optimization steps, which differs from the traditional $ \ell_{2} $-regularizer that affects the gradient moments. The authors prove that AdamW converges and minimizes a dynamically regularized loss that combines the original loss with a dynamic regularization term induced by the decoupled weight decay. This leads to different behavior compared to Adam and $ \ell_{2} $-regularized Adam ($ \ell_{2} $-Adam). On both general nonconvex problems and PL-conditioned problems, the authors establish the stochastic gradient complexity of AdamW to find a stationary point. This complexity is applicable to Adam and $ \ell_{2} $-Adam and improves their previously known complexity, especially for over-parameterized networks. Additionally, the authors show that AdamW enjoys smaller generalization errors than Adam and $ \ell_{2} $-Adam from the Bayesian posterior perspective. This result explicitly reveals the benefits of decoupled weight decay in AdamW. The paper also compares AdamW with $ \ell_{2} $-Adam, showing that AdamW often achieves better generalization performance. Theoretical analysis and experimental results validate these findings. The experiments demonstrate that AdamW performs well on ImageNet, with significant improvements in generalization compared to $ \ell_{2} $-Adam. The results suggest that the decoupled weight decay in AdamW contributes to better convergence and generalization properties.
Reach us at info@study.space
Understanding Towards Understanding Convergence and Generalization of AdamW