TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

29 Dec 2017 | Wei Wen1, Cong Xu2, Feng Yan3, Chunpeng Wu1, Yandan Wang4, Yiran Chen1, Hai Li1
TernGrad is a method that reduces communication costs in distributed deep learning by using ternary gradients. It employs three numerical levels {-1, 0, 1} to significantly reduce communication time. The method is mathematically proven to converge under gradient bounds. Layer-wise ternarizing and gradient clipping are introduced to improve convergence. Experiments show that TernGrad does not reduce accuracy on AlexNet and even improves it, with less than 2% accuracy loss on GoogLeNet. A performance model is proposed to study the scalability of TernGrad, showing significant speed gains for various deep neural networks. TernGrad is compatible with asynchronous SGD and other training settings. The method reduces communication overhead by quantizing gradients to ternary levels, and introduces scaler sharing to further reduce communication. The convergence of TernGrad is proven under gradient bounds, and the method is shown to converge to the minimum of the loss function. Experiments on various networks and training schemes demonstrate that TernGrad achieves similar accuracy to standard SGD while reducing communication time. The method is scalable and effective for large-scale distributed training. TernGrad is implemented in TensorFlow and validated on multiple datasets and network architectures. The performance model shows that TernGrad improves training throughput, especially for networks with high communication-to-computation ratios. The method is efficient and effective for distributed deep learning.TernGrad is a method that reduces communication costs in distributed deep learning by using ternary gradients. It employs three numerical levels {-1, 0, 1} to significantly reduce communication time. The method is mathematically proven to converge under gradient bounds. Layer-wise ternarizing and gradient clipping are introduced to improve convergence. Experiments show that TernGrad does not reduce accuracy on AlexNet and even improves it, with less than 2% accuracy loss on GoogLeNet. A performance model is proposed to study the scalability of TernGrad, showing significant speed gains for various deep neural networks. TernGrad is compatible with asynchronous SGD and other training settings. The method reduces communication overhead by quantizing gradients to ternary levels, and introduces scaler sharing to further reduce communication. The convergence of TernGrad is proven under gradient bounds, and the method is shown to converge to the minimum of the loss function. Experiments on various networks and training schemes demonstrate that TernGrad achieves similar accuracy to standard SGD while reducing communication time. The method is scalable and effective for large-scale distributed training. TernGrad is implemented in TensorFlow and validated on multiple datasets and network architectures. The performance model shows that TernGrad improves training throughput, especially for networks with high communication-to-computation ratios. The method is efficient and effective for distributed deep learning.
Reach us at info@study.space