The paper introduces the Gaussian Error Linear Unit (GELU), a new activation function for neural networks. GELU is defined as \( x \Phi(x) \), where \( \Phi(x) \) is the standard Gaussian cumulative distribution function. Unlike ReLU, which gates inputs based on their sign, GELU weights inputs by their value. The authors evaluate GELU against ReLU and ELU on various tasks, including computer vision, natural language processing, and speech recognition, finding that GELU performs at least as well as, and often better than, these other activations. The GELU nonlinearity is motivated by combining properties from dropout, zoneout, and ReLU, and it is shown to be more robust to noisy inputs and to perform well with different learning rates. The paper also discusses the relationship between GELU and other nonlinearities, noting that GELU can be seen as a smoother version of ReLU and that it has a probabilistic interpretation as the expectation of a stochastic regularizer. The authors provide practical tips for using GELU and conclude that it is a viable alternative to previous nonlinearities.The paper introduces the Gaussian Error Linear Unit (GELU), a new activation function for neural networks. GELU is defined as \( x \Phi(x) \), where \( \Phi(x) \) is the standard Gaussian cumulative distribution function. Unlike ReLU, which gates inputs based on their sign, GELU weights inputs by their value. The authors evaluate GELU against ReLU and ELU on various tasks, including computer vision, natural language processing, and speech recognition, finding that GELU performs at least as well as, and often better than, these other activations. The GELU nonlinearity is motivated by combining properties from dropout, zoneout, and ReLU, and it is shown to be more robust to noisy inputs and to perform well with different learning rates. The paper also discusses the relationship between GELU and other nonlinearities, noting that GELU can be seen as a smoother version of ReLU and that it has a probabilistic interpretation as the expectation of a stochastic regularizer. The authors provide practical tips for using GELU and conclude that it is a viable alternative to previous nonlinearities.