24 Jun 2024 | Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhi-Quan Luo
This paper investigates why stochastic gradient descent (SGD) performs worse than Adam on Transformers. The authors propose that the issue lies in the "block heterogeneity" of the Hessian matrix, which refers to the large differences in the Hessian spectra across parameter blocks in Transformers. Unlike CNNs, which have similar parameter blocks, Transformers involve non-sequential stacking of disparate parameter blocks, leading to significant differences in the Hessian spectra. This heterogeneity hampers SGD, which uses a single learning rate for all blocks, while Adam, which uses coordinate-wise learning rates, can better handle this heterogeneity. The authors validate this hypothesis through both empirical and theoretical analysis, showing that SGD performs worse on problems with block heterogeneity. They also demonstrate that block heterogeneity is a key factor in the performance gap between SGD and Adam on Transformers. Additionally, they show that pre-trained Transformers can have reduced block heterogeneity, leading to better performance for SGD. The paper also provides theoretical results on quadratic models, showing that Adam can achieve better convergence rates than SGD on problems with heterogeneous Hessians. The authors conclude that block heterogeneity is an important factor in the performance gap between SGD and Adam on Transformers, and that the choice of optimizer should be guided by the blockwise Hessian spectrum.This paper investigates why stochastic gradient descent (SGD) performs worse than Adam on Transformers. The authors propose that the issue lies in the "block heterogeneity" of the Hessian matrix, which refers to the large differences in the Hessian spectra across parameter blocks in Transformers. Unlike CNNs, which have similar parameter blocks, Transformers involve non-sequential stacking of disparate parameter blocks, leading to significant differences in the Hessian spectra. This heterogeneity hampers SGD, which uses a single learning rate for all blocks, while Adam, which uses coordinate-wise learning rates, can better handle this heterogeneity. The authors validate this hypothesis through both empirical and theoretical analysis, showing that SGD performs worse on problems with block heterogeneity. They also demonstrate that block heterogeneity is a key factor in the performance gap between SGD and Adam on Transformers. Additionally, they show that pre-trained Transformers can have reduced block heterogeneity, leading to better performance for SGD. The paper also provides theoretical results on quadratic models, showing that Adam can achieve better convergence rates than SGD on problems with heterogeneous Hessians. The authors conclude that block heterogeneity is an important factor in the performance gap between SGD and Adam on Transformers, and that the choice of optimizer should be guided by the blockwise Hessian spectrum.