Why Transformers Need Adam: A Hessian Perspective

Why Transformers Need Adam: A Hessian Perspective

24 Jun 2024 | Yushun Zhang12, Congliang Chen12, Tian Ding2, Ziniu Li12, Ruoyu Sun12*, Zhi-Quan Luo12
The paper explores why Stochastic Gradient Descent (SGD) performs significantly worse than Adam on Transformers, a type of neural network architecture. The authors investigate the Hessian spectrum of Transformers, which reveals a phenomenon called "block heterogeneity," where the Hessian spectrum varies significantly across different parameter blocks within a Transformer. This heterogeneity is not observed in Convolutional Neural Networks (CNNs). The authors find that SGD performs worse than Adam on problems with block heterogeneity, while it performs similarly on problems without it. They provide theoretical and empirical evidence to support their findings, showing that SGD's single learning rate for all blocks cannot handle the heterogeneity, whereas Adam's coordinate-wise learning rates can. The paper also discusses the implications of these findings for optimizing large models and suggests a quantitative metric to predict the performance gap between SGD and Adam.The paper explores why Stochastic Gradient Descent (SGD) performs significantly worse than Adam on Transformers, a type of neural network architecture. The authors investigate the Hessian spectrum of Transformers, which reveals a phenomenon called "block heterogeneity," where the Hessian spectrum varies significantly across different parameter blocks within a Transformer. This heterogeneity is not observed in Convolutional Neural Networks (CNNs). The authors find that SGD performs worse than Adam on problems with block heterogeneity, while it performs similarly on problems without it. They provide theoretical and empirical evidence to support their findings, showing that SGD's single learning rate for all blocks cannot handle the heterogeneity, whereas Adam's coordinate-wise learning rates can. The paper also discusses the implications of these findings for optimizing large models and suggests a quantitative metric to predict the performance gap between SGD and Adam.
Reach us at info@study.space
[slides and audio] Why Transformers Need Adam%3A A Hessian Perspective