[slides] NaturalCodeBench%3A Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

The paper introduces NATURALCODEBENCH (NCB), a challenging code benchmark designed to evaluate the performance of large language models (LLMs) in real-world coding tasks. NCB consists of 402 high-quality problems in Python and Java, selected from natural user queries on online coding services, covering six different domains. The benchmark aims to address the limitations of existing benchmarks like HumanEval, MBPP, and DS-1000, which are primarily focused on introductory tasks in algorithm and data science. To enhance the efficiency of test case construction, the authors propose a semi-automated pipeline that leverages GPT-4 to generate reference solutions and test cases, followed by manual correction. This pipeline reduces the construction time by more than 4 times compared to manual methods. Experiments on 39 LLMs show that while some models perform well on HumanEval, they exhibit significant performance gaps on NCB, indicating a lack of focus on practical coding scenarios or over-specified optimization on HumanEval. Even the best-performing GPT-4 only achieves a pass rate of about 53% on NCB, highlighting the need for further improvements. The paper also discusses the limitations of NCB, such as the inability to test certain types of problems and the high cost of accessing OpenAI's API. The evaluation toolkit and development set are made available to the community to facilitate future research.The paper introduces NATURALCODEBENCH (NCB), a challenging code benchmark designed to evaluate the performance of large language models (LLMs) in real-world coding tasks. NCB consists of 402 high-quality problems in Python and Java, selected from natural user queries on online coding services, covering six different domains. The benchmark aims to address the limitations of existing benchmarks like HumanEval, MBPP, and DS-1000, which are primarily focused on introductory tasks in algorithm and data science. To enhance the efficiency of test case construction, the authors propose a semi-automated pipeline that leverages GPT-4 to generate reference solutions and test cases, followed by manual correction. This pipeline reduces the construction time by more than 4 times compared to manual methods. Experiments on 39 LLMs show that while some models perform well on HumanEval, they exhibit significant performance gaps on NCB, indicating a lack of focus on practical coding scenarios or over-specified optimization on HumanEval. Even the best-performing GPT-4 only achieves a pass rate of about 53% on NCB, highlighting the need for further improvements. The paper also discusses the limitations of NCB, such as the inability to test certain types of problems and the high cost of accessing OpenAI's API. The evaluation toolkit and development set are made available to the community to facilitate future research.

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

7 May 2024 | Shudan Zhang12†, Hanlin Zhao1, Xiao Liu12, Qinkai Zheng12, Zehan Qi12†, Xiaotao Gu1, Xiaohan Zhang1, Yuxiao Dong2, Jie Tang2

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

7 May 2024 | Shudan Zhang12†*, Hanlin Zhao1*, Xiao Liu12*, Qinkai Zheng12*, Zehan Qi12†, Xiaotao Gu1, Xiaohan Zhang1, Yuxiao Dong2, Jie Tang2

7 May 2024 | Shudan Zhang12†, Hanlin Zhao1, Xiao Liu12, Qinkai Zheng12, Zehan Qi12†, Xiaotao Gu1, Xiaohan Zhang1, Yuxiao Dong2, Jie Tang2