BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

26 Jun 2024 | Terry Yue Zhuo, Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Binyuan Hui, Niklas Muennighoff, David Lo, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro von Werra
BigCodeBench is a new benchmark for evaluating code generation capabilities of large language models (LLMs), focusing on diverse function calls and complex instructions. It includes 1,140 programming tasks across 139 libraries and 7 domains, with each task containing 5.6 test cases and an average branch coverage of 99%. The benchmark includes two variants: BigCodeBench-Complete, which uses structured docstrings, and BigCodeBench-Instruct, which uses natural language instructions. The evaluation of 60 LLMs shows that they struggle with complex instructions and function calls, with scores up to 60% on BigCodeBench-Complete and less than 50% on BigCodeBench-Instruct, significantly lower than human performance of 97%. The benchmark was constructed through collaboration between human experts and LLMs, involving data synthesis, program refactoring, and human curation. The benchmark includes a wide range of tools and functions, and its evaluation shows that LLMs still lack the ability to align with human expectations when instructions are more natural. The benchmark also shows strong correlations with existing benchmarks, validating its evaluation results. The results highlight the need for further advancements in code generation capabilities of LLMs.BigCodeBench is a new benchmark for evaluating code generation capabilities of large language models (LLMs), focusing on diverse function calls and complex instructions. It includes 1,140 programming tasks across 139 libraries and 7 domains, with each task containing 5.6 test cases and an average branch coverage of 99%. The benchmark includes two variants: BigCodeBench-Complete, which uses structured docstrings, and BigCodeBench-Instruct, which uses natural language instructions. The evaluation of 60 LLMs shows that they struggle with complex instructions and function calls, with scores up to 60% on BigCodeBench-Complete and less than 50% on BigCodeBench-Instruct, significantly lower than human performance of 97%. The benchmark was constructed through collaboration between human experts and LLMs, involving data synthesis, program refactoring, and human curation. The benchmark includes a wide range of tools and functions, and its evaluation shows that LLMs still lack the ability to align with human expectations when instructions are more natural. The benchmark also shows strong correlations with existing benchmarks, validating its evaluation results. The results highlight the need for further advancements in code generation capabilities of LLMs.
Reach us at info@study.space
[slides and audio] BigCodeBench%3A Benchmarking Code Generation with Diverse Function Calls and Complex Instructions