2019 | Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, H. Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander
This paper presents a scalable production system for Federated Learning (FL) on mobile devices, built using TensorFlow. The system enables training deep neural networks on decentralized data stored on phones, with weights combined in the cloud via Federated Averaging. The authors describe the high-level design, challenges, and solutions, including device availability, unreliable connectivity, and limited storage and compute resources. The system has been applied in large-scale applications, such as phone keyboards, and supports tens of millions of real-world devices. The paper also discusses the protocol, architecture, and tools for model engineering, as well as analytics for monitoring device health and performance. Secure Aggregation is introduced to enhance privacy, and the system's operational profile is provided, highlighting its scalability and performance. Finally, the paper outlines future work, including addressing bias, improving convergence time, optimizing device scheduling, reducing bandwidth usage, and expanding to Federated Computation.This paper presents a scalable production system for Federated Learning (FL) on mobile devices, built using TensorFlow. The system enables training deep neural networks on decentralized data stored on phones, with weights combined in the cloud via Federated Averaging. The authors describe the high-level design, challenges, and solutions, including device availability, unreliable connectivity, and limited storage and compute resources. The system has been applied in large-scale applications, such as phone keyboards, and supports tens of millions of real-world devices. The paper also discusses the protocol, architecture, and tools for model engineering, as well as analytics for monitoring device health and performance. Secure Aggregation is introduced to enhance privacy, and the system's operational profile is provided, highlighting its scalability and performance. Finally, the paper outlines future work, including addressing bias, improving convergence time, optimizing device scheduling, reducing bandwidth usage, and expanding to Federated Computation.