The Gaussian Error Linear Unit (GELU) is a new neural network activation function that outperforms ReLU and ELU in various tasks. GELU is defined as xΦ(x), where Φ(x) is the standard Gaussian cumulative distribution function. Unlike ReLU, which gates inputs by their sign, GELU weights inputs by their value. The GELU nonlinearity is related to stochastic regularizers and can be viewed as the expectation of a modification to Adaptive Dropout. It is shown that GELU matches or exceeds models with ReLUs or ELUs across tasks in computer vision, natural language processing, and speech recognition.
The GELU is motivated by combining properties from dropout, zoneout, and ReLUs. It is defined as the expected transformation of a stochastic regularizer on an input x, which is xΦ(x). The GELU can be approximated with different formulas, including 0.5x(1 + tanh[√(2/π)(x + 0.044715x³)]) or xσ(1.702x). The GELU is shown to be more robust to noisy inputs and performs well with dropout.
Experiments on various tasks, including MNIST classification, MNIST autoencoding, Twitter POS tagging, TIMIT frame classification, and CIFAR-10/100 classification, show that GELU outperforms ReLU and ELU. In MNIST classification, GELU achieves lower training log loss and higher test accuracy. In MNIST autoencoding, GELU outperforms other nonlinearities. In Twitter POS tagging, GELU achieves a lower test error rate. In TIMIT frame classification, GELU achieves lower test error. In CIFAR-10/100 classification, GELU achieves lower test error.
The GELU is a non-convex, non-monotonic function that is not linear in the positive domain and exhibits curvature at all points. Unlike ReLU and ELU, which are convex and monotonic, GELU can be both negative and positive. The GELU has a probabilistic interpretation as it is the expectation of a stochastic regularizer. It is also shown that GELU can be used with a variety of optimizers and learning rates.
The GELU has been widely adopted in state-of-the-art Transformers, becoming the default activation function. The GELU has also been recognized as the original source of the idea, with proper credit given to the original work. The GELU is a viable alternative to previous nonlinearities and has shown consistent performance across various tasks.The Gaussian Error Linear Unit (GELU) is a new neural network activation function that outperforms ReLU and ELU in various tasks. GELU is defined as xΦ(x), where Φ(x) is the standard Gaussian cumulative distribution function. Unlike ReLU, which gates inputs by their sign, GELU weights inputs by their value. The GELU nonlinearity is related to stochastic regularizers and can be viewed as the expectation of a modification to Adaptive Dropout. It is shown that GELU matches or exceeds models with ReLUs or ELUs across tasks in computer vision, natural language processing, and speech recognition.
The GELU is motivated by combining properties from dropout, zoneout, and ReLUs. It is defined as the expected transformation of a stochastic regularizer on an input x, which is xΦ(x). The GELU can be approximated with different formulas, including 0.5x(1 + tanh[√(2/π)(x + 0.044715x³)]) or xσ(1.702x). The GELU is shown to be more robust to noisy inputs and performs well with dropout.
Experiments on various tasks, including MNIST classification, MNIST autoencoding, Twitter POS tagging, TIMIT frame classification, and CIFAR-10/100 classification, show that GELU outperforms ReLU and ELU. In MNIST classification, GELU achieves lower training log loss and higher test accuracy. In MNIST autoencoding, GELU outperforms other nonlinearities. In Twitter POS tagging, GELU achieves a lower test error rate. In TIMIT frame classification, GELU achieves lower test error. In CIFAR-10/100 classification, GELU achieves lower test error.
The GELU is a non-convex, non-monotonic function that is not linear in the positive domain and exhibits curvature at all points. Unlike ReLU and ELU, which are convex and monotonic, GELU can be both negative and positive. The GELU has a probabilistic interpretation as it is the expectation of a stochastic regularizer. It is also shown that GELU can be used with a variety of optimizers and learning rates.
The GELU has been widely adopted in state-of-the-art Transformers, becoming the default activation function. The GELU has also been recognized as the original source of the idea, with proper credit given to the original work. The GELU is a viable alternative to previous nonlinearities and has shown consistent performance across various tasks.