GLU Variants Improve Transformer

GLU Variants Improve Transformer

February 14, 2020 | Noam Shazeer
Noam Shazeer of Google presents a study on improving the Transformer model by using variations of Gated Linear Units (GLU). GLU is a neural network layer that applies a component-wise product of two linear transformations, one of which is passed through a sigmoid function. The paper explores different nonlinear and linear alternatives to the sigmoid function in GLU, testing them in the feedforward sublayers of the Transformer model. The results show that some GLU variants outperform the standard ReLU or GELU activations. The Transformer model uses position-wise feed-forward networks (FFN), which apply two linear transformations to an input vector, followed by a ReLU activation. The paper proposes replacing the ReLU with other activation functions like GELU and Swish. It then introduces GLU variants, such as ReGLU, GEGLU, and SwiGLU, which use different activation functions in place of the sigmoid. These variants are applied to the FFN layer, replacing the first linear transformation and the activation function. The study tests these variants on the Text-to-Text Transfer Transformer (T5) model, pre-training on the C4 dataset and fine-tuning on various language understanding tasks. The results show that GEGLU and SwiGLU variants achieve the best perplexity on the segment-filling task. During fine-tuning, these variants also perform better on multiple downstream tasks, including GLUE, SuperGLUE, and SQuAD benchmarks. The paper concludes that GLU variants improve model performance without significant computational overhead. The success of these variants is attributed to their ability to better capture the structure of the data, though the exact reason is not fully explained.Noam Shazeer of Google presents a study on improving the Transformer model by using variations of Gated Linear Units (GLU). GLU is a neural network layer that applies a component-wise product of two linear transformations, one of which is passed through a sigmoid function. The paper explores different nonlinear and linear alternatives to the sigmoid function in GLU, testing them in the feedforward sublayers of the Transformer model. The results show that some GLU variants outperform the standard ReLU or GELU activations. The Transformer model uses position-wise feed-forward networks (FFN), which apply two linear transformations to an input vector, followed by a ReLU activation. The paper proposes replacing the ReLU with other activation functions like GELU and Swish. It then introduces GLU variants, such as ReGLU, GEGLU, and SwiGLU, which use different activation functions in place of the sigmoid. These variants are applied to the FFN layer, replacing the first linear transformation and the activation function. The study tests these variants on the Text-to-Text Transfer Transformer (T5) model, pre-training on the C4 dataset and fine-tuning on various language understanding tasks. The results show that GEGLU and SwiGLU variants achieve the best perplexity on the segment-filling task. During fine-tuning, these variants also perform better on multiple downstream tasks, including GLUE, SuperGLUE, and SQuAD benchmarks. The paper concludes that GLU variants improve model performance without significant computational overhead. The success of these variants is attributed to their ability to better capture the structure of the data, though the exact reason is not fully explained.
Reach us at info@study.space