Federated Learning on Non-IID Data Silos: An Experimental Study

Federated Learning on Non-IID Data Silos: An Experimental Study

28 Oct 2021 | Qinbin Li*, Yiqun Diao*, Quan Chen, Bingsheng He
This paper addresses the challenge of training machine learning models from distributed databases, where data is fragmented into multiple "data silos" due to privacy concerns and data regulations. Federated Learning (FL) is proposed as a solution to collaboratively train models without exchanging raw data. The key challenge in FL is the heterogeneity of data distributions among parties, often referred to as non-independently and identically distributed (non-IID) data. Previous studies have used rigid data partitioning strategies, which are not representative or thorough. To address this, the authors propose comprehensive data partitioning strategies to cover typical non-IID cases and conduct extensive experiments to evaluate state-of-the-art FL algorithms. The paper introduces six non-IID data partitioning strategies, including label distribution skew, feature distribution skew, and quantity skew. Extensive experiments on nine datasets evaluate four state-of-the-art FL algorithms: FedAvg, FedProx, SCAFFOLD, and FedNova. The results show that non-IID data significantly impacts the accuracy of FL algorithms, and no single algorithm consistently outperforms others in all cases. The effectiveness of FL is highly dependent on the type of data skew, with label distribution skew being more challenging than quantity skew. Instability in the learning process, due to techniques like batch normalization and partial sampling, is also observed. The main contributions of the paper include identifying non-IID data as a key challenge in FL and developing a benchmark (NIID-Bench) for researchers to study FL on non-IID data. The benchmark includes six comprehensive non-IID data partitioning strategies, and the authors provide insights and future directions for data management and learning in distributed data silos. The paper concludes with a discussion on the trade-offs between accuracy and communication efficiency, the introduction of new training factors, and the importance of addressing mixed types of skew in future research.This paper addresses the challenge of training machine learning models from distributed databases, where data is fragmented into multiple "data silos" due to privacy concerns and data regulations. Federated Learning (FL) is proposed as a solution to collaboratively train models without exchanging raw data. The key challenge in FL is the heterogeneity of data distributions among parties, often referred to as non-independently and identically distributed (non-IID) data. Previous studies have used rigid data partitioning strategies, which are not representative or thorough. To address this, the authors propose comprehensive data partitioning strategies to cover typical non-IID cases and conduct extensive experiments to evaluate state-of-the-art FL algorithms. The paper introduces six non-IID data partitioning strategies, including label distribution skew, feature distribution skew, and quantity skew. Extensive experiments on nine datasets evaluate four state-of-the-art FL algorithms: FedAvg, FedProx, SCAFFOLD, and FedNova. The results show that non-IID data significantly impacts the accuracy of FL algorithms, and no single algorithm consistently outperforms others in all cases. The effectiveness of FL is highly dependent on the type of data skew, with label distribution skew being more challenging than quantity skew. Instability in the learning process, due to techniques like batch normalization and partial sampling, is also observed. The main contributions of the paper include identifying non-IID data as a key challenge in FL and developing a benchmark (NIID-Bench) for researchers to study FL on non-IID data. The benchmark includes six comprehensive non-IID data partitioning strategies, and the authors provide insights and future directions for data management and learning in distributed data silos. The paper concludes with a discussion on the trade-offs between accuracy and communication efficiency, the introduction of new training factors, and the importance of addressing mixed types of skew in future research.
Reach us at info@study.space
Understanding Federated Learning on Non-IID Data Silos%3A An Experimental Study