CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

6 Apr 2024 | Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhiouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhui Chen, Jie Fu
CodeEditorBench is an evaluation framework designed to assess the performance of Large Language Models (LLMs) in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focused on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. The framework curates diverse coding challenges from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. The dataset includes 7,961 code editing tasks, each with an average of 44 test cases. The framework includes four problem types: Code Debug, Code Translate, Code Polish, and Code Requirement Switch. The dataset is enriched with Large Language Model-generated test cases, which are verified by an Online Judge System. The benchmark is developed using specific methodologies, and assessments involve crafting prompts for zero-shot, three-shot, and chain of thought settings. Outputs are filtered and integrated with templates for compilation. The OJ's batch judging determines the LLMs' scores, ensuring a rigorous evaluation process. The results show that closed-source models perform better in CodeEditorBench_Plus, with GPT-4 excelling in Debug, Translate, and Switch categories, while Gemini-Ultra performs well in Polish tasks. Open-source models like OpenCI-DS-33B also show strong performance. The analysis highlights the challenges posed by CodeEditorBench and areas for further research in modern software development. The benchmark is designed to be dynamic and scalable, with periodic updates to incorporate new problems, scenarios, and models. The study emphasizes the importance of ethical considerations in evaluating LLMs for code editing tasks, ensuring fairness, inclusivity, and transparency. The limitations of the study include model coverage, task selection bias, evaluation metrics, real-world relevance, and the dynamic nature of LLMs. The findings contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.CodeEditorBench is an evaluation framework designed to assess the performance of Large Language Models (LLMs) in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focused on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. The framework curates diverse coding challenges from five sources, covering various programming languages, complexity levels, and editing tasks. Evaluation of 19 LLMs reveals that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models, highlighting differences in model performance based on problem types and prompt sensitivities. CodeEditorBench aims to catalyze advancements in LLMs by providing a robust platform for assessing code editing capabilities. The dataset includes 7,961 code editing tasks, each with an average of 44 test cases. The framework includes four problem types: Code Debug, Code Translate, Code Polish, and Code Requirement Switch. The dataset is enriched with Large Language Model-generated test cases, which are verified by an Online Judge System. The benchmark is developed using specific methodologies, and assessments involve crafting prompts for zero-shot, three-shot, and chain of thought settings. Outputs are filtered and integrated with templates for compilation. The OJ's batch judging determines the LLMs' scores, ensuring a rigorous evaluation process. The results show that closed-source models perform better in CodeEditorBench_Plus, with GPT-4 excelling in Debug, Translate, and Switch categories, while Gemini-Ultra performs well in Polish tasks. Open-source models like OpenCI-DS-33B also show strong performance. The analysis highlights the challenges posed by CodeEditorBench and areas for further research in modern software development. The benchmark is designed to be dynamic and scalable, with periodic updates to incorporate new problems, scenarios, and models. The study emphasizes the importance of ethical considerations in evaluating LLMs for code editing tasks, ensuring fairness, inclusivity, and transparency. The limitations of the study include model coverage, task selection bias, evaluation metrics, real-world relevance, and the dynamic nature of LLMs. The findings contribute to the advancement of LLMs in code editing and provide a valuable resource for researchers and practitioners.
Reach us at info@study.space
[slides] CodeEditorBench%3A Evaluating Code Editing Capability of Large Language Models | StudySpace