[slides and audio] BigCodeBench%3A Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

**BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions** **Authors:** Terry Yue Zhuo and Core contributors, random ordering **Affiliation:** Monash University **Contact Information:** https://bigcode-bench.github.io/, {terry.zhuo@monash.edu; contact@bigcode-project.org} **Abstract:** This paper introduces BigCodeBench, a benchmark designed to evaluate the capabilities of Large Language Models (LLMs) in solving challenging and practical programming tasks. Unlike existing benchmarks that focus on short, self-contained algorithmic tasks, BigCodeBench emphasizes the use of diverse function calls and complex instructions. The benchmark includes 1,140 fine-grained programming tasks from 139 libraries across 7 domains, each encompassing 5.6 test cases with an average branch coverage of 99%. Two variants, BigCodeBench-Complete and BigCodeBench-Instruct, are proposed to assess LLMs' ability to handle structured docstrings and natural language instructions, respectively. Extensive evaluations of 60 LLMs show that current models struggle with complex instructions and function calls, achieving scores of up to 60%, significantly lower than human performance of 97%. The results highlight the need for further advancements in LLMs to better handle real-world programming tasks. **Key Contributions:** 1. **Benchmark Construction:** BigCodeBench is constructed through a collaboration between LLMs and human experts, ensuring rigorous evaluation. 2. **Task Diversity:** The benchmark covers a wide range of function calls and domains, requiring compositional reasoning and complex instruction following. 3. **Evaluation Framework:** A detailed evaluation framework is provided to assess LLMs' performance on both task-solving and tool-use. **Findings:** - LLMs struggle with complex instructions and function calls, with scores significantly lower than human performance. - Instruction-tuned LLMs show some improvement but still omit essential details. - LLMs are sensitive to the verbosity of programming instructions, performing worse on less verbose instructions. **Future Work:** - The authors plan to address limitations such as multilingualism, reliability, efficiency, and generalization. - They aim to release a minimal subset of BigCodeBench for easier evaluation and to develop variants for out-of-distribution tasks and interactive environments. **Conclusion:** BigCodeBench provides a comprehensive and challenging benchmark for evaluating LLMs' capabilities in real-world programming tasks. The findings highlight the need for further research to improve LLMs' performance in handling complex and diverse programming scenarios.**BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions** **Authors:** Terry Yue Zhuo and Core contributors, random ordering **Affiliation:** Monash University **Contact Information:** https://bigcode-bench.github.io/, {terry.zhuo@monash.edu; contact@bigcode-project.org} **Abstract:** This paper introduces BigCodeBench, a benchmark designed to evaluate the capabilities of Large Language Models (LLMs) in solving challenging and practical programming tasks. Unlike existing benchmarks that focus on short, self-contained algorithmic tasks, BigCodeBench emphasizes the use of diverse function calls and complex instructions. The benchmark includes 1,140 fine-grained programming tasks from 139 libraries across 7 domains, each encompassing 5.6 test cases with an average branch coverage of 99%. Two variants, BigCodeBench-Complete and BigCodeBench-Instruct, are proposed to assess LLMs' ability to handle structured docstrings and natural language instructions, respectively. Extensive evaluations of 60 LLMs show that current models struggle with complex instructions and function calls, achieving scores of up to 60%, significantly lower than human performance of 97%. The results highlight the need for further advancements in LLMs to better handle real-world programming tasks. **Key Contributions:** 1. **Benchmark Construction:** BigCodeBench is constructed through a collaboration between LLMs and human experts, ensuring rigorous evaluation. 2. **Task Diversity:** The benchmark covers a wide range of function calls and domains, requiring compositional reasoning and complex instruction following. 3. **Evaluation Framework:** A detailed evaluation framework is provided to assess LLMs' performance on both task-solving and tool-use. **Findings:** - LLMs struggle with complex instructions and function calls, with scores significantly lower than human performance. - Instruction-tuned LLMs show some improvement but still omit essential details. - LLMs are sensitive to the verbosity of programming instructions, performing worse on less verbose instructions. **Future Work:** - The authors plan to address limitations such as multilingualism, reliability, efficiency, and generalization. - They aim to release a minimal subset of BigCodeBench for easier evaluation and to develop variants for out-of-distribution tasks and interactive environments. **Conclusion:** BigCodeBench provides a comprehensive and challenging benchmark for evaluating LLMs' capabilities in real-world programming tasks. The findings highlight the need for further research to improve LLMs' performance in handling complex and diverse programming scenarios.