[slides and audio] Implicit Bias of AdamW%3A %E2%84%93%E2%88%9E Norm Constrained Optimization

This paper explores the implicit bias of AdamW, a popular optimization algorithm used in deep learning, particularly for language modeling tasks. The authors aim to understand why AdamW outperforms Adam with $\ell_2$ regularization in terms of generalization and optimization. They show that AdamW implicitly performs constrained optimization, specifically under the constraint that the $\ell_\infty$ norm of the parameters is bounded by the inverse of the weight decay factor. The main result, Theorem 1.1, states that if AdamW converges with a non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss function under this constraint. The proof relies on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to the $\ell_\infty$ norm. The paper also provides a convergence analysis for normalized steepest descent with weight decay for convex loss functions and discusses the relationship between the $\ell_\infty$ norm and hyperparameters. Experimental results support the theoretical findings, showing that the $\ell_\infty$ norm of iterates in AdamW can be bounded by $\frac{1}{\lambda}$ under certain conditions. The work contributes to the understanding of the implicit regularization properties of AdamW and provides insights into its superior performance over Adam with $\ell_2$ regularization.This paper explores the implicit bias of AdamW, a popular optimization algorithm used in deep learning, particularly for language modeling tasks. The authors aim to understand why AdamW outperforms Adam with $\ell_2$ regularization in terms of generalization and optimization. They show that AdamW implicitly performs constrained optimization, specifically under the constraint that the $\ell_\infty$ norm of the parameters is bounded by the inverse of the weight decay factor. The main result, Theorem 1.1, states that if AdamW converges with a non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss function under this constraint. The proof relies on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to the $\ell_\infty$ norm. The paper also provides a convergence analysis for normalized steepest descent with weight decay for convex loss functions and discusses the relationship between the $\ell_\infty$ norm and hyperparameters. Experimental results support the theoretical findings, showing that the $\ell_\infty$ norm of iterates in AdamW can be bounded by $\frac{1}{\lambda}$ under certain conditions. The work contributes to the understanding of the implicit regularization properties of AdamW and provides insights into its superior performance over Adam with $\ell_2$ regularization.

Implicit Bias of AdamW: ℓ∞ Norm Constrained Optimization

5 Apr 2024 | Shuo Xie, Zhiyuan Li