The paper introduces NATURALCODEBENCH (NCB), a challenging code benchmark designed to evaluate the performance of large language models (LLMs) in real-world coding tasks. NCB consists of 402 high-quality problems in Python and Java, selected from natural user queries on online coding services, covering six different domains. The benchmark aims to address the limitations of existing benchmarks like HumanEval, MBPP, and DS-1000, which are primarily focused on introductory tasks in algorithm and data science.
To enhance the efficiency of test case construction, the authors propose a semi-automated pipeline that leverages GPT-4 to generate reference solutions and test cases, followed by manual correction. This pipeline reduces the construction time by more than 4 times compared to manual methods.
Experiments on 39 LLMs show that while some models perform well on HumanEval, they exhibit significant performance gaps on NCB, indicating a lack of focus on practical coding scenarios or over-specified optimization on HumanEval. Even the best-performing GPT-4 only achieves a pass rate of about 53% on NCB, highlighting the need for further improvements.
The paper also discusses the limitations of NCB, such as the inability to test certain types of problems and the high cost of accessing OpenAI's API. The evaluation toolkit and development set are made available to the community to facilitate future research.The paper introduces NATURALCODEBENCH (NCB), a challenging code benchmark designed to evaluate the performance of large language models (LLMs) in real-world coding tasks. NCB consists of 402 high-quality problems in Python and Java, selected from natural user queries on online coding services, covering six different domains. The benchmark aims to address the limitations of existing benchmarks like HumanEval, MBPP, and DS-1000, which are primarily focused on introductory tasks in algorithm and data science.
To enhance the efficiency of test case construction, the authors propose a semi-automated pipeline that leverages GPT-4 to generate reference solutions and test cases, followed by manual correction. This pipeline reduces the construction time by more than 4 times compared to manual methods.
Experiments on 39 LLMs show that while some models perform well on HumanEval, they exhibit significant performance gaps on NCB, indicating a lack of focus on practical coding scenarios or over-specified optimization on HumanEval. Even the best-performing GPT-4 only achieves a pass rate of about 53% on NCB, highlighting the need for further improvements.
The paper also discusses the limitations of NCB, such as the inability to test certain types of problems and the high cost of accessing OpenAI's API. The evaluation toolkit and development set are made available to the community to facilitate future research.