Understanding GLU Variants Improve Transformer

The paper "GLU Variants Improve Transformer" by Noam Shazeer from Google explores the use of gated linear units (GLU) and their variants in the feed-forward sublayers of the Transformer model. GLU, introduced by Dauphin et al. (2016), consists of the component-wise product of two linear projections, one of which is passed through a sigmoid function. The authors test various GLU variants, including ReGLU, GEGLU, and SwiGLU, using different nonlinear functions such as ReLU, GELU, and Swish. In the experiments, the FFN layers in the Transformer model are replaced with these GLU variants. The models are trained on the Text-to-Text Transfer Transformer (T5) setup, which involves pre-training on a denoising objective and fine-tuning on language understanding tasks. The results show that the GLU variants, particularly GEGLU and SwiGLU, produce better perplexities during pre-training and achieve superior performance on downstream tasks compared to the standard ReLU activation function. The paper concludes that these GLU variants are simple to implement and do not introduce significant computational overhead, suggesting that they could be a valuable addition to the Transformer architecture.The paper "GLU Variants Improve Transformer" by Noam Shazeer from Google explores the use of gated linear units (GLU) and their variants in the feed-forward sublayers of the Transformer model. GLU, introduced by Dauphin et al. (2016), consists of the component-wise product of two linear projections, one of which is passed through a sigmoid function. The authors test various GLU variants, including ReGLU, GEGLU, and SwiGLU, using different nonlinear functions such as ReLU, GELU, and Swish. In the experiments, the FFN layers in the Transformer model are replaced with these GLU variants. The models are trained on the Text-to-Text Transfer Transformer (T5) setup, which involves pre-training on a denoising objective and fine-tuning on language understanding tasks. The results show that the GLU variants, particularly GEGLU and SwiGLU, produce better perplexities during pre-training and achieve superior performance on downstream tasks compared to the standard ReLU activation function. The paper concludes that these GLU variants are simple to implement and do not introduce significant computational overhead, suggesting that they could be a valuable addition to the Transformer architecture.

GLU Variants Improve Transformer

February 14, 2020 | Noam Shazeer