21 Jul 2016 | Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
The paper introduces layer normalization as a method to speed up the training of neural networks by normalizing the summed inputs to neurons within a layer. Unlike batch normalization, which normalizes inputs using statistics from a mini-batch, layer normalization computes these statistics over all summed inputs to neurons in a layer for a single training case. This approach is particularly useful for recurrent neural networks (RNNs) because it avoids the need to compute separate statistics for each time step, making it easier to handle variable sequence lengths. Layer normalization is shown to be effective in stabilizing hidden state dynamics in RNNs and significantly reducing training time compared to other techniques. The paper also provides a theoretical analysis comparing the invariance properties of layer normalization with batch normalization and weight normalization, demonstrating that layer normalization is invariant to per-training case feature shifting and scaling. Experimental results on various tasks, including image-sentence ranking, question-answering, and handwriting sequence generation, support the effectiveness of layer normalization.The paper introduces layer normalization as a method to speed up the training of neural networks by normalizing the summed inputs to neurons within a layer. Unlike batch normalization, which normalizes inputs using statistics from a mini-batch, layer normalization computes these statistics over all summed inputs to neurons in a layer for a single training case. This approach is particularly useful for recurrent neural networks (RNNs) because it avoids the need to compute separate statistics for each time step, making it easier to handle variable sequence lengths. Layer normalization is shown to be effective in stabilizing hidden state dynamics in RNNs and significantly reducing training time compared to other techniques. The paper also provides a theoretical analysis comparing the invariance properties of layer normalization with batch normalization and weight normalization, demonstrating that layer normalization is invariant to per-training case feature shifting and scaling. Experimental results on various tasks, including image-sentence ranking, question-answering, and handwriting sequence generation, support the effectiveness of layer normalization.