NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

7 May 2024 | Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang
NaturalCodeBench (NCB) is a new code benchmark designed to reflect the complexity and diversity of real-world coding tasks. It consists of 402 high-quality problems in Python and Java, selected from natural user queries on online coding services across six domains. NCB addresses the limitations of existing benchmarks like HumanEval, which focus on introductory tasks and lack real-world complexity. NCB includes a variety of data types, including files, lists, dictionaries, and tensors, and is evaluated in an executable Docker environment. A semi-automated pipeline is introduced to efficiently construct test cases, significantly improving the construction speed of the benchmark. The benchmark is evaluated on 39 LLMs, revealing significant performance gaps between models with similar HumanEval scores, indicating a lack of focus on practical coding scenarios. Even the top-performing GPT-4 achieves only a 53% pass rate on NCB, highlighting the need for further improvements in LLMs for real-world coding tasks. NCB provides a dockerized evaluation environment and a development set of 140 problems for research purposes. The benchmark includes a comprehensive dataset with detailed statistics, and extensive experiments show that larger models generally perform better, but more refined data and training strategies are also crucial for model performance. NCB aims to provide a fair evaluation environment for comparing models and to inspire further research in complex tasks with high evaluation costs.NaturalCodeBench (NCB) is a new code benchmark designed to reflect the complexity and diversity of real-world coding tasks. It consists of 402 high-quality problems in Python and Java, selected from natural user queries on online coding services across six domains. NCB addresses the limitations of existing benchmarks like HumanEval, which focus on introductory tasks and lack real-world complexity. NCB includes a variety of data types, including files, lists, dictionaries, and tensors, and is evaluated in an executable Docker environment. A semi-automated pipeline is introduced to efficiently construct test cases, significantly improving the construction speed of the benchmark. The benchmark is evaluated on 39 LLMs, revealing significant performance gaps between models with similar HumanEval scores, indicating a lack of focus on practical coding scenarios. Even the top-performing GPT-4 achieves only a 53% pass rate on NCB, highlighting the need for further improvements in LLMs for real-world coding tasks. NCB provides a dockerized evaluation environment and a development set of 140 problems for research purposes. The benchmark includes a comprehensive dataset with detailed statistics, and extensive experiments show that larger models generally perform better, but more refined data and training strategies are also crucial for model performance. NCB aims to provide a fair evaluation environment for comparing models and to inspire further research in complex tasks with high evaluation costs.
Reach us at info@study.space