7 Aug 2018 | Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, Anima Anandkumar
SIGNSGD (Sign-based Stochastic Gradient Descent) is a novel optimization method designed to reduce communication costs in distributed training of large neural networks. By transmitting only the sign of each minibatch stochastic gradient, SIGNSGD achieves both compressed gradients and a convergence rate comparable to standard Stochastic Gradient Descent (SGD). The effectiveness of SIGNSGD is demonstrated through theoretical analysis, which shows that it can outperform SGD in problems with a particular $\ell_1$ geometry: when gradients are as dense or denser than stochasticity and curvature. Empirically, SIGNSGD is shown to match the accuracy and convergence speed of ADAM on deep Imagenet models. The paper also extends the theory to a distributed setting where the parameter server uses majority vote to aggregate gradient signs from workers, enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823), the authors prove that majority vote can achieve the same reduction in variance as full-precision distributed SGD, highlighting the potential of sign-based optimization schemes for fast communication and convergence. The code for reproducing the experiments is available at <https://github.com/jxbz/signSGD>.SIGNSGD (Sign-based Stochastic Gradient Descent) is a novel optimization method designed to reduce communication costs in distributed training of large neural networks. By transmitting only the sign of each minibatch stochastic gradient, SIGNSGD achieves both compressed gradients and a convergence rate comparable to standard Stochastic Gradient Descent (SGD). The effectiveness of SIGNSGD is demonstrated through theoretical analysis, which shows that it can outperform SGD in problems with a particular $\ell_1$ geometry: when gradients are as dense or denser than stochasticity and curvature. Empirically, SIGNSGD is shown to match the accuracy and convergence speed of ADAM on deep Imagenet models. The paper also extends the theory to a distributed setting where the parameter server uses majority vote to aggregate gradient signs from workers, enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823), the authors prove that majority vote can achieve the same reduction in variance as full-precision distributed SGD, highlighting the potential of sign-based optimization schemes for fast communication and convergence. The code for reproducing the experiments is available at <https://github.com/jxbz/signSGD>.