[slides] Local SGD Converges Fast and Communicates Little

Local Stochastic Gradient Descent (SGD) is a method that aims to reduce communication costs in large-scale distributed training by running SGD independently on different workers and averaging sequences infrequently. This paper provides a theoretical analysis of Local SGD, proving that it converges at the same rate as mini-batch SGD in terms of the number of evaluated gradients, achieving linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced by a factor of \( T^{1/2} \) compared to mini-batch SGD. The results are applicable to both synchronous and asynchronous implementations. The paper also discusses the practical implications and guidelines for using Local SGD, including the choice of synchronization intervals and mini-batch sizes. Experimental results on a logistic regression problem validate the theoretical findings, showing significant speedup in communication rounds.Local Stochastic Gradient Descent (SGD) is a method that aims to reduce communication costs in large-scale distributed training by running SGD independently on different workers and averaging sequences infrequently. This paper provides a theoretical analysis of Local SGD, proving that it converges at the same rate as mini-batch SGD in terms of the number of evaluated gradients, achieving linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced by a factor of \( T^{1/2} \) compared to mini-batch SGD. The results are applicable to both synchronous and asynchronous implementations. The paper also discusses the practical implications and guidelines for using Local SGD, including the choice of synchronization intervals and mini-batch sizes. Experimental results on a logistic regression problem validate the theoretical findings, showing significant speedup in communication rounds.

Local SGD Converges Fast and Communicates Little

3 May 2019 | Sebastian U. Stich