29 Jun 2020 | Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu
This paper investigates the importance of the learning rate warm-up stage in training the Transformer architecture and shows that the placement of layer normalization significantly affects training stability. The study theoretically demonstrates that in the original Post-LN Transformer, which places layer normalization between residual blocks, the gradients of parameters near the output layer are large at initialization. This leads to unstable training when using a large learning rate, making the warm-up stage essential. In contrast, the Pre-LN Transformer, which places layer normalization inside the residual blocks, has well-behaved gradients at initialization, allowing the warm-up stage to be safely removed. Experiments show that Pre-LN Transformers can achieve comparable performance to Post-LN Transformers without the warm-up stage, requiring significantly less training time and hyperparameter tuning. Theoretical analysis using mean field theory confirms that the gradient scales differ between the two architectures, with Pre-LN Transformers having smaller gradient scales. Empirical results on machine translation and unsupervised pre-training tasks support these findings, demonstrating that Pre-LN Transformers converge faster and are easier to optimize. The study highlights the importance of layer normalization placement in Transformer training and suggests that the warm-up stage is not always necessary for Pre-LN Transformers.This paper investigates the importance of the learning rate warm-up stage in training the Transformer architecture and shows that the placement of layer normalization significantly affects training stability. The study theoretically demonstrates that in the original Post-LN Transformer, which places layer normalization between residual blocks, the gradients of parameters near the output layer are large at initialization. This leads to unstable training when using a large learning rate, making the warm-up stage essential. In contrast, the Pre-LN Transformer, which places layer normalization inside the residual blocks, has well-behaved gradients at initialization, allowing the warm-up stage to be safely removed. Experiments show that Pre-LN Transformers can achieve comparable performance to Post-LN Transformers without the warm-up stage, requiring significantly less training time and hyperparameter tuning. Theoretical analysis using mean field theory confirms that the gradient scales differ between the two architectures, with Pre-LN Transformers having smaller gradient scales. Empirical results on machine translation and unsupervised pre-training tasks support these findings, demonstrating that Pre-LN Transformers converge faster and are easier to optimize. The study highlights the importance of layer normalization placement in Transformer training and suggests that the warm-up stage is not always necessary for Pre-LN Transformers.