11 Sep 2017 | Xiangru Lian*1, Ce Zhang†2, Huan Zhang‡3, Cho-Jui Hsieh§3, Wei Zhang¶4, and Ji Liu∥1,5
This paper investigates whether decentralized algorithms can outperform centralized algorithms in distributed stochastic gradient descent (PSGD). The authors propose a decentralized parallel stochastic gradient descent (D-PSGD) algorithm and provide theoretical analysis showing that it can achieve similar computational complexity to centralized PSGD (C-PSGD) but with significantly lower communication costs. D-PSGD avoids the communication bottleneck of centralized algorithms by allowing nodes to communicate only with their neighbors, rather than through a central node. Theoretical analysis shows that D-PSGD can achieve linear speedup in computational complexity, making it more efficient in scenarios with high communication overhead. Empirical experiments across multiple frameworks (CNTK and Torch), different network configurations, and up to 112 GPUs show that D-PSGD can be up to 10 times faster than well-optimized centralized counterparts, especially in networks with low bandwidth or high latency. The study demonstrates that decentralized algorithms can outperform centralized ones in certain scenarios, offering a promising direction for future research in distributed machine learning systems.This paper investigates whether decentralized algorithms can outperform centralized algorithms in distributed stochastic gradient descent (PSGD). The authors propose a decentralized parallel stochastic gradient descent (D-PSGD) algorithm and provide theoretical analysis showing that it can achieve similar computational complexity to centralized PSGD (C-PSGD) but with significantly lower communication costs. D-PSGD avoids the communication bottleneck of centralized algorithms by allowing nodes to communicate only with their neighbors, rather than through a central node. Theoretical analysis shows that D-PSGD can achieve linear speedup in computational complexity, making it more efficient in scenarios with high communication overhead. Empirical experiments across multiple frameworks (CNTK and Torch), different network configurations, and up to 112 GPUs show that D-PSGD can be up to 10 times faster than well-optimized centralized counterparts, especially in networks with low bandwidth or high latency. The study demonstrates that decentralized algorithms can outperform centralized ones in certain scenarios, offering a promising direction for future research in distributed machine learning systems.