15 Feb 2018 | Sharan Narang, Gregory Diamos, Erich Elsen, Paulius Micikevicius, Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu
Mixed precision training uses half-precision (FP16) floating-point numbers to train deep neural networks without losing accuracy or requiring changes to hyperparameters. This approach nearly halves memory usage and speeds up arithmetic on recent GPUs. Weights, activations, and gradients are stored in FP16, but three techniques are used to prevent loss of critical information: maintaining a FP32 master copy of weights, loss-scaling to preserve small gradient values, and using FP16 arithmetic with accumulation in FP32. This methodology works across various tasks and large-scale models with over 100 million parameters, trained on large datasets.
The paper introduces three key techniques for mixed precision training: (1) maintaining a FP32 master copy of weights for updates, (2) loss-scaling to preserve small gradient values, and (3) using FP16 arithmetic with accumulation in FP32. These techniques allow training of a wide variety of network architectures and applications, including image classification, image generation, object detection, language modeling, machine translation, and speech recognition, without accuracy loss compared to FP32 training.
Experiments show that mixed precision training works well for CNNs, detection networks, speech recognition, machine translation, and language modeling. For example, mixed precision training for ILSVRC classification achieved top-1 accuracy matching FP32 training. Detection networks like Faster-RCNN and Multibox-SSD also benefited from mixed precision training with loss-scaling. Speech recognition models, such as DeepSpeech 2, showed improved performance with mixed precision training. Machine translation models, like those trained on the WMT15 dataset, also achieved comparable results with mixed precision training.
The paper concludes that mixed precision training is an important technique that reduces memory consumption and speeds up training and inference. It allows training of many different deep learning models without accuracy loss, and introduces gradient scaling for models with large numbers of small gradient values. Future work includes further optimizations for mixed precision training and extending the technique to generative models and deep reinforcement learning applications.Mixed precision training uses half-precision (FP16) floating-point numbers to train deep neural networks without losing accuracy or requiring changes to hyperparameters. This approach nearly halves memory usage and speeds up arithmetic on recent GPUs. Weights, activations, and gradients are stored in FP16, but three techniques are used to prevent loss of critical information: maintaining a FP32 master copy of weights, loss-scaling to preserve small gradient values, and using FP16 arithmetic with accumulation in FP32. This methodology works across various tasks and large-scale models with over 100 million parameters, trained on large datasets.
The paper introduces three key techniques for mixed precision training: (1) maintaining a FP32 master copy of weights for updates, (2) loss-scaling to preserve small gradient values, and (3) using FP16 arithmetic with accumulation in FP32. These techniques allow training of a wide variety of network architectures and applications, including image classification, image generation, object detection, language modeling, machine translation, and speech recognition, without accuracy loss compared to FP32 training.
Experiments show that mixed precision training works well for CNNs, detection networks, speech recognition, machine translation, and language modeling. For example, mixed precision training for ILSVRC classification achieved top-1 accuracy matching FP32 training. Detection networks like Faster-RCNN and Multibox-SSD also benefited from mixed precision training with loss-scaling. Speech recognition models, such as DeepSpeech 2, showed improved performance with mixed precision training. Machine translation models, like those trained on the WMT15 dataset, also achieved comparable results with mixed precision training.
The paper concludes that mixed precision training is an important technique that reduces memory consumption and speeds up training and inference. It allows training of many different deep learning models without accuracy loss, and introduces gradient scaling for models with large numbers of small gradient values. Future work includes further optimizations for mixed precision training and extending the technique to generative models and deep reinforcement learning applications.