Implicit Bias of AdamW: ℓ∞ Norm Constrained Optimization

Implicit Bias of AdamW: ℓ∞ Norm Constrained Optimization

5 Apr 2024 | Shuo Xie, Zhiyuan Li
This paper investigates the implicit bias of AdamW, a variant of the Adam optimizer with decoupled weight decay, in the context of $ \ell_{\infty} $-norm constrained optimization. The authors show that AdamW implicitly performs constrained optimization by converging to KKT points of the constrained optimization problem $ \min_{\|x\|_{\infty} \leq \frac{1}{\lambda}} L(x) $, where $ \lambda $ is the weight decay factor. This result is derived by analyzing the behavior of AdamW in the full-batch setting and showing that it asymptotically behaves like SignGD with weight decay, which is the normalized steepest descent with respect to the $ \ell_{\infty} $-norm. The analysis also reveals that AdamW's implicit bias is closely related to the geometry of the loss function and the choice of norm. The paper further provides theoretical guarantees for the convergence of AdamW in both convex and non-convex settings, and demonstrates that the $ \ell_{\infty} $-norm can lead to better optimization performance compared to the $ \ell_{2} $-norm. The results are supported by experiments on language modeling tasks and synthetic problems, showing that AdamW's implicit bias leads to better convergence properties and performance. The paper also highlights the importance of understanding the implicit regularization of optimization algorithms and the role of weight decay in shaping the optimization trajectory.This paper investigates the implicit bias of AdamW, a variant of the Adam optimizer with decoupled weight decay, in the context of $ \ell_{\infty} $-norm constrained optimization. The authors show that AdamW implicitly performs constrained optimization by converging to KKT points of the constrained optimization problem $ \min_{\|x\|_{\infty} \leq \frac{1}{\lambda}} L(x) $, where $ \lambda $ is the weight decay factor. This result is derived by analyzing the behavior of AdamW in the full-batch setting and showing that it asymptotically behaves like SignGD with weight decay, which is the normalized steepest descent with respect to the $ \ell_{\infty} $-norm. The analysis also reveals that AdamW's implicit bias is closely related to the geometry of the loss function and the choice of norm. The paper further provides theoretical guarantees for the convergence of AdamW in both convex and non-convex settings, and demonstrates that the $ \ell_{\infty} $-norm can lead to better optimization performance compared to the $ \ell_{2} $-norm. The results are supported by experiments on language modeling tasks and synthetic problems, showing that AdamW's implicit bias leads to better convergence properties and performance. The paper also highlights the importance of understanding the implicit regularization of optimization algorithms and the role of weight decay in shaping the optimization trajectory.
Reach us at info@study.space
[slides and audio] Implicit Bias of AdamW%3A %E2%84%93%E2%88%9E Norm Constrained Optimization