This paper analyzes the convergence of Federated Averaging (FedAvg) in non-iid data settings. FedAvg is a widely used algorithm in federated learning, where multiple devices collaboratively train a model without sharing data. The algorithm performs local stochastic gradient descent (SGD) on a subset of devices and averages the results periodically. However, its convergence properties under non-iid data have not been well understood.
The authors establish a convergence rate of $ \mathcal{O}(1/T) $ for strongly convex and smooth problems, where T is the number of SGD steps. This result shows that FedAvg converges to the global optimum at a rate inversely proportional to the number of iterations. The analysis reveals a trade-off between communication efficiency and convergence rate. The study also shows that partial device participation, where not all devices are involved in each round, can be achieved without significantly slowing down the learning process.
The paper also demonstrates that the heterogeneity of data slows down the convergence of FedAvg, which aligns with empirical observations. Furthermore, the authors provide a necessary condition for FedAvg on non-iid data: the learning rate must decay, even when using full gradients. A fixed learning rate leads to suboptimal solutions that are $ \Omega(\eta) $ away from the optimal.
The study also compares FedAvg with other algorithms and shows that the convergence rate of FedAvg is affected by the choice of sampling and averaging schemes. The authors propose a new scheme that improves the convergence rate in non-iid settings. The paper also includes numerical experiments that validate the theoretical results and show the effectiveness of the proposed methods. The results highlight the importance of appropriate sampling and averaging schemes in achieving efficient convergence in federated learning.This paper analyzes the convergence of Federated Averaging (FedAvg) in non-iid data settings. FedAvg is a widely used algorithm in federated learning, where multiple devices collaboratively train a model without sharing data. The algorithm performs local stochastic gradient descent (SGD) on a subset of devices and averages the results periodically. However, its convergence properties under non-iid data have not been well understood.
The authors establish a convergence rate of $ \mathcal{O}(1/T) $ for strongly convex and smooth problems, where T is the number of SGD steps. This result shows that FedAvg converges to the global optimum at a rate inversely proportional to the number of iterations. The analysis reveals a trade-off between communication efficiency and convergence rate. The study also shows that partial device participation, where not all devices are involved in each round, can be achieved without significantly slowing down the learning process.
The paper also demonstrates that the heterogeneity of data slows down the convergence of FedAvg, which aligns with empirical observations. Furthermore, the authors provide a necessary condition for FedAvg on non-iid data: the learning rate must decay, even when using full gradients. A fixed learning rate leads to suboptimal solutions that are $ \Omega(\eta) $ away from the optimal.
The study also compares FedAvg with other algorithms and shows that the convergence rate of FedAvg is affected by the choice of sampling and averaging schemes. The authors propose a new scheme that improves the convergence rate in non-iid settings. The paper also includes numerical experiments that validate the theoretical results and show the effectiveness of the proposed methods. The results highlight the importance of appropriate sampling and averaging schemes in achieving efficient convergence in federated learning.