TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN

TOWARDS FEDERATED LEARNING AT SCALE: SYSTEM DESIGN

2019 | Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander
This paper presents a scalable production system for Federated Learning (FL) on mobile devices, built using TensorFlow. The system supports synchronous training rounds, which are essential for privacy-preserving FL, and includes mechanisms for secure aggregation to ensure individual device updates remain private. The system is designed to handle large-scale FL populations, with up to hundreds of millions of devices, and is capable of running large-batch SGD-style algorithms as well as the Federated Averaging algorithm. The system architecture includes a communication protocol that enables devices to participate in FL rounds, with phases for selection, configuration, and reporting. The system also includes a server infrastructure that uses an actor model to manage FL tasks, with actors handling message passing and coordination. The server is designed to scale with the number of devices and update sizes, and includes mechanisms for pipelining and handling failure modes. The system also includes tools and workflows for model engineers to define, test, and deploy FL tasks, with support for versioning, testing, and deployment. The system has been applied in large-scale applications, such as phone keyboards, and has demonstrated the ability to handle privacy-sensitive data without transmitting it to servers. The system also includes analytics to monitor device health and performance, and to detect and resolve issues in FL training. The paper also discusses related work, including alternative approaches to FL and other applications of FL in different domains. The system is designed to be flexible and scalable, with the ability to handle a wide range of FL tasks and applications.This paper presents a scalable production system for Federated Learning (FL) on mobile devices, built using TensorFlow. The system supports synchronous training rounds, which are essential for privacy-preserving FL, and includes mechanisms for secure aggregation to ensure individual device updates remain private. The system is designed to handle large-scale FL populations, with up to hundreds of millions of devices, and is capable of running large-batch SGD-style algorithms as well as the Federated Averaging algorithm. The system architecture includes a communication protocol that enables devices to participate in FL rounds, with phases for selection, configuration, and reporting. The system also includes a server infrastructure that uses an actor model to manage FL tasks, with actors handling message passing and coordination. The server is designed to scale with the number of devices and update sizes, and includes mechanisms for pipelining and handling failure modes. The system also includes tools and workflows for model engineers to define, test, and deploy FL tasks, with support for versioning, testing, and deployment. The system has been applied in large-scale applications, such as phone keyboards, and has demonstrated the ability to handle privacy-sensitive data without transmitting it to servers. The system also includes analytics to monitor device health and performance, and to detect and resolve issues in FL training. The paper also discusses related work, including alternative approaches to FL and other applications of FL in different domains. The system is designed to be flexible and scalable, with the ability to handle a wide range of FL tasks and applications.
Reach us at info@study.space