29 Jun 2020 | Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu
This paper investigates the necessity of the learning rate warm-up stage in training Transformers, particularly focusing on the impact of layer normalization placement. The authors theoretically analyze the gradients at initialization for two Transformer variants: Post-Layer Normalization (Post-LN) and Pre-Layer Normalization (Pre-LN). They prove that in Post-LN Transformers, the expected gradients near the output layer are large, making the training unstable without a warm-up stage. In contrast, Pre-LN Transformers have well-behaved gradients at initialization, allowing the warm-up stage to be removed. Empirical experiments on machine translation tasks and unsupervised pre-training (BERT) show that Pre-LN Transformers can achieve comparable performance with significantly reduced training time and hyper-parameter tuning. The findings suggest that the placement of layer normalization significantly affects the optimization process in Transformers.This paper investigates the necessity of the learning rate warm-up stage in training Transformers, particularly focusing on the impact of layer normalization placement. The authors theoretically analyze the gradients at initialization for two Transformer variants: Post-Layer Normalization (Post-LN) and Pre-Layer Normalization (Pre-LN). They prove that in Post-LN Transformers, the expected gradients near the output layer are large, making the training unstable without a warm-up stage. In contrast, Pre-LN Transformers have well-behaved gradients at initialization, allowing the warm-up stage to be removed. Empirical experiments on machine translation tasks and unsupervised pre-training (BERT) show that Pre-LN Transformers can achieve comparable performance with significantly reduced training time and hyper-parameter tuning. The findings suggest that the placement of layer normalization significantly affects the optimization process in Transformers.