DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING

DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING

23 Jun 2020 | Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally
Deep Gradient Compression (DGC) reduces communication bandwidth in distributed training by compressing gradients. The method achieves a compression ratio of 270× to 600× without accuracy loss, significantly reducing gradient size for models like ResNet-50 (from 97MB to 0.35MB) and DeepSpeech (from 488MB to 0.74MB). DGC employs four techniques: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. Momentum correction addresses staleness by adjusting the momentum term, while local gradient clipping prevents gradient explosion. Momentum factor masking prevents outdated gradients from affecting model updates, and warm-up training helps the model adapt to sparse gradients. DGC is effective across various tasks, including image classification, language modeling, and speech recognition, and enables large-scale distributed training on inexpensive networks. The method improves scalability and reduces communication overhead, making it suitable for mobile devices and low-bandwidth environments.Deep Gradient Compression (DGC) reduces communication bandwidth in distributed training by compressing gradients. The method achieves a compression ratio of 270× to 600× without accuracy loss, significantly reducing gradient size for models like ResNet-50 (from 97MB to 0.35MB) and DeepSpeech (from 488MB to 0.74MB). DGC employs four techniques: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. Momentum correction addresses staleness by adjusting the momentum term, while local gradient clipping prevents gradient explosion. Momentum factor masking prevents outdated gradients from affecting model updates, and warm-up training helps the model adapt to sparse gradients. DGC is effective across various tasks, including image classification, language modeling, and speech recognition, and enables large-scale distributed training on inexpensive networks. The method improves scalability and reduces communication overhead, making it suitable for mobile devices and low-bandwidth environments.
Reach us at info@study.space
[slides] Deep Gradient Compression%3A Reducing the Communication Bandwidth for Distributed Training | StudySpace