Understanding Deep Gradient Compression%3A Reducing the Communication Bandwidth for Distributed Training

The paper "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training" addresses the significant communication bandwidth requirements in large-scale distributed training, which limits the scalability and efficiency of multi-node training. The authors propose Deep Gradient Compression (DGC) to reduce the communication bandwidth by up to 600× without compromising accuracy. DGC employs four methods—momentum correction, local gradient clipping, momentum factor masking, and warm-up training—to preserve model performance during gradient compression. The effectiveness of DGC is demonstrated through experiments on various tasks, including image classification, speech recognition, and language modeling, using datasets such as Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. The results show that DGC significantly reduces gradient size, enabling large-scale distributed training on inexpensive 1Gbps Ethernet and facilitating mobile training. The code for DGC is available at <https://github.com/synxlin/deep-gradient-compression>.The paper "Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training" addresses the significant communication bandwidth requirements in large-scale distributed training, which limits the scalability and efficiency of multi-node training. The authors propose Deep Gradient Compression (DGC) to reduce the communication bandwidth by up to 600× without compromising accuracy. DGC employs four methods—momentum correction, local gradient clipping, momentum factor masking, and warm-up training—to preserve model performance during gradient compression. The effectiveness of DGC is demonstrated through experiments on various tasks, including image classification, speech recognition, and language modeling, using datasets such as Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. The results show that DGC significantly reduces gradient size, enabling large-scale distributed training on inexpensive 1Gbps Ethernet and facilitating mobile training. The code for DGC is available at <https://github.com/synxlin/deep-gradient-compression>.

DEEP GRADIENT COMPRESSION: REDUCING THE COMMUNICATION BANDWIDTH FOR DISTRIBUTED TRAINING

23 Jun 2020 | Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally