The paper explores why Stochastic Gradient Descent (SGD) performs significantly worse than Adam on Transformers, a type of neural network architecture. The authors investigate the Hessian spectrum of Transformers, which reveals a phenomenon called "block heterogeneity," where the Hessian spectrum varies significantly across different parameter blocks within a Transformer. This heterogeneity is not observed in Convolutional Neural Networks (CNNs). The authors find that SGD performs worse than Adam on problems with block heterogeneity, while it performs similarly on problems without it. They provide theoretical and empirical evidence to support their findings, showing that SGD's single learning rate for all blocks cannot handle the heterogeneity, whereas Adam's coordinate-wise learning rates can. The paper also discusses the implications of these findings for optimizing large models and suggests a quantitative metric to predict the performance gap between SGD and Adam.The paper explores why Stochastic Gradient Descent (SGD) performs significantly worse than Adam on Transformers, a type of neural network architecture. The authors investigate the Hessian spectrum of Transformers, which reveals a phenomenon called "block heterogeneity," where the Hessian spectrum varies significantly across different parameter blocks within a Transformer. This heterogeneity is not observed in Convolutional Neural Networks (CNNs). The authors find that SGD performs worse than Adam on problems with block heterogeneity, while it performs similarly on problems without it. They provide theoretical and empirical evidence to support their findings, showing that SGD's single learning rate for all blocks cannot handle the heterogeneity, whereas Adam's coordinate-wise learning rates can. The paper also discusses the implications of these findings for optimizing large models and suggests a quantitative metric to predict the performance gap between SGD and Adam.