ON THE CONVERGENCE OF FEDAVG ON NON-IID DATA

ON THE CONVERGENCE OF FEDAVG ON NON-IID DATA

25 Jun 2020 | Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, Zhihua Zhang
This paper analyzes the convergence of Federated Averaging (FedAvg) on non-iid data, a leading algorithm in federated learning. FedAvg runs Stochastic Gradient Descent (SGD) in parallel on a subset of devices and averages the updates periodically. The authors establish a convergence rate of \(\mathcal{O}(\frac{1}{T})\) for strongly convex and smooth problems, where \(T\) is the number of SGD iterations. They demonstrate a trade-off between communication efficiency and convergence rate, showing that low device participation can be achieved without significantly slowing down the learning process. The study also highlights that heterogeneity in data slows down convergence, aligning with empirical observations. Additionally, the paper provides a necessary condition for FedAvg on non-iid data: the learning rate must decay, even with full-gradient descent, to avoid solutions being \(\Omega(\eta)\) away from the optimal. Theoretical guarantees are provided for two sampling and averaging schemes, and numerical experiments verify the theoretical findings.This paper analyzes the convergence of Federated Averaging (FedAvg) on non-iid data, a leading algorithm in federated learning. FedAvg runs Stochastic Gradient Descent (SGD) in parallel on a subset of devices and averages the updates periodically. The authors establish a convergence rate of \(\mathcal{O}(\frac{1}{T})\) for strongly convex and smooth problems, where \(T\) is the number of SGD iterations. They demonstrate a trade-off between communication efficiency and convergence rate, showing that low device participation can be achieved without significantly slowing down the learning process. The study also highlights that heterogeneity in data slows down convergence, aligning with empirical observations. Additionally, the paper provides a necessary condition for FedAvg on non-iid data: the learning rate must decay, even with full-gradient descent, to avoid solutions being \(\Omega(\eta)\) away from the optimal. Theoretical guarantees are provided for two sampling and averaging schemes, and numerical experiments verify the theoretical findings.
Reach us at info@study.space