Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

2 Mar 2015 | Sergey Ioffe, Christian Szegedy
Batch Normalization is a technique that accelerates the training of deep neural networks by reducing internal covariate shift. Internal covariate shift refers to the change in the distribution of layer inputs during training, which complicates training and requires careful initialization and lower learning rates. Batch Normalization addresses this by normalizing layer inputs, making normalization part of the model architecture. This allows higher learning rates and reduces the need for careful initialization. It also acts as a regularizer, sometimes eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps and outperforms the original model. Using an ensemble of batch-normalized networks, the best published result on ImageNet classification is improved, reaching 4.9% top-5 validation error and 4.8% test error, exceeding human raters' accuracy. The method normalizes layer inputs using mini-batch statistics, making each activation have zero mean and unit variance. This normalization is applied independently to each activation, allowing the network to maintain stable input distributions. The normalization is differentiable and does not require full training set analysis after each parameter update. Batch Normalization is applied to each layer's input, with parameters learned alongside the model. This ensures the network can represent the original activations and maintains its capacity. During training, gradients are backpropagated through the normalization, allowing the model to learn effectively. Inference uses population statistics instead of mini-batch statistics to ensure deterministic output. Batch Normalization enables higher learning rates by stabilizing the input distributions and reducing the impact of parameter scale on gradient propagation. It also regularizes the model, reducing the need for Dropout. Experiments show that Batch Normalization accelerates training, improves accuracy, and allows the use of saturating nonlinearities. When applied to the Inception network, Batch Normalization significantly improves performance, achieving higher accuracy with fewer training steps. An ensemble of Batch Normalized networks further improves performance on ImageNet, surpassing previous results. The method is effective for various network architectures, including convolutional networks, and has potential applications in recurrent neural networks and domain adaptation.Batch Normalization is a technique that accelerates the training of deep neural networks by reducing internal covariate shift. Internal covariate shift refers to the change in the distribution of layer inputs during training, which complicates training and requires careful initialization and lower learning rates. Batch Normalization addresses this by normalizing layer inputs, making normalization part of the model architecture. This allows higher learning rates and reduces the need for careful initialization. It also acts as a regularizer, sometimes eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps and outperforms the original model. Using an ensemble of batch-normalized networks, the best published result on ImageNet classification is improved, reaching 4.9% top-5 validation error and 4.8% test error, exceeding human raters' accuracy. The method normalizes layer inputs using mini-batch statistics, making each activation have zero mean and unit variance. This normalization is applied independently to each activation, allowing the network to maintain stable input distributions. The normalization is differentiable and does not require full training set analysis after each parameter update. Batch Normalization is applied to each layer's input, with parameters learned alongside the model. This ensures the network can represent the original activations and maintains its capacity. During training, gradients are backpropagated through the normalization, allowing the model to learn effectively. Inference uses population statistics instead of mini-batch statistics to ensure deterministic output. Batch Normalization enables higher learning rates by stabilizing the input distributions and reducing the impact of parameter scale on gradient propagation. It also regularizes the model, reducing the need for Dropout. Experiments show that Batch Normalization accelerates training, improves accuracy, and allows the use of saturating nonlinearities. When applied to the Inception network, Batch Normalization significantly improves performance, achieving higher accuracy with fewer training steps. An ensemble of Batch Normalized networks further improves performance on ImageNet, surpassing previous results. The method is effective for various network architectures, including convolutional networks, and has potential applications in recurrent neural networks and domain adaptation.
Reach us at info@study.space
Understanding Batch Normalization%3A Accelerating Deep Network Training by Reducing Internal Covariate Shift