FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

7 Jun 2024 | Rui Ye, Rui Ge, Xinyu Zhu, Jingyi Chai, Yaxin Du, Yang Liu, Yanfeng Wang, Siheng Chen
FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models FedLLM-Bench is a comprehensive benchmark for federated learning of large language models (FedLLM), offering realistic datasets and evaluation metrics to facilitate research in this area. The benchmark includes 8 training methods, 4 training datasets, and 6 evaluation metrics. It encompasses three datasets for federated instruction tuning and one dataset for federated preference alignment, with client numbers ranging from 38 to 747. These datasets capture real-world properties such as language, quality, quantity, instruction, length, embedding, and preference. FedLLM-Bench provides a practical testbed for the FedLLM community, reducing the required efforts and promoting fair comparisons. The datasets are available at https://github.com/rui-ye/FedLLM-Bench. The benchmark includes diverse tasks, scales, languages, qualities, quantities, lengths, and preferences, mirroring the complexities and diversities of real-world scenarios. Based on these datasets, experiments are conducted to benchmark existing FL methods and provide empirical insights. FedLLM-Bench also supports exploration of new research directions due to its flexibility and diversity. The benchmark includes datasets for multilingual instruction tuning and preference alignment, and experiments demonstrate the effectiveness of federated learning in enhancing the performance of LLMs. The benchmark also evaluates the effectiveness of differential privacy in FedLLM. The results show that FedAvg with differential privacy can achieve comparable performance to FedAvg without differential privacy. FedLLM-Bench is the first realistic benchmark for FedLLM, providing a comprehensive testbed for the community.FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models FedLLM-Bench is a comprehensive benchmark for federated learning of large language models (FedLLM), offering realistic datasets and evaluation metrics to facilitate research in this area. The benchmark includes 8 training methods, 4 training datasets, and 6 evaluation metrics. It encompasses three datasets for federated instruction tuning and one dataset for federated preference alignment, with client numbers ranging from 38 to 747. These datasets capture real-world properties such as language, quality, quantity, instruction, length, embedding, and preference. FedLLM-Bench provides a practical testbed for the FedLLM community, reducing the required efforts and promoting fair comparisons. The datasets are available at https://github.com/rui-ye/FedLLM-Bench. The benchmark includes diverse tasks, scales, languages, qualities, quantities, lengths, and preferences, mirroring the complexities and diversities of real-world scenarios. Based on these datasets, experiments are conducted to benchmark existing FL methods and provide empirical insights. FedLLM-Bench also supports exploration of new research directions due to its flexibility and diversity. The benchmark includes datasets for multilingual instruction tuning and preference alignment, and experiments demonstrate the effectiveness of federated learning in enhancing the performance of LLMs. The benchmark also evaluates the effectiveness of differential privacy in FedLLM. The results show that FedAvg with differential privacy can achieve comparable performance to FedAvg without differential privacy. FedLLM-Bench is the first realistic benchmark for FedLLM, providing a comprehensive testbed for the community.
Reach us at info@study.space
[slides] FedLLM-Bench%3A Realistic Benchmarks for Federated Learning of Large Language Models | StudySpace