**CodeEditorBench: Evaluating Code Editing Capability of Large Language Models**
**Introduction:**
Large Language Models (LLMs) are rapidly evolving in code editing capabilities, which are crucial for software development. CodeEditorBench is an evaluation framework designed to rigorously assess LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks that focus solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development.
**Methodology:**
CodeEditorBench curates diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. The dataset is enriched with LLM-generated test cases, verified by an Online Judge System (OJ). The benchmark is developed for four problem types using specific methodologies, and the evaluation involves crafting prompts for zero-shot, three-shot, and chain of thought settings. Outputs are filtered and integrated with templates for compilation, and the OJ's batch judging determines the LLMs' scores.
**Results:**
The evaluation of 19 LLMs reveals that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench. GPT-4 excels in three out of four areas, while Gemini-Ultra performs well in few-shot scenarios. The analysis highlights the variability in model performance based on problem category and scenario, with smaller models sometimes surpassing larger ones in efficiency.
**Conclusion:**
CodeEditorBench aims to catalyze advancements in LLMs for code editing by providing a robust platform for assessment. The benchmark will be periodically updated to incorporate new problems, scenarios, and models. The findings contribute to the advancement of LLMs in code editing and serve as a valuable resource for researchers and practitioners.
**Ethics and Limitations:**
The research prioritizes ethical considerations, ensuring fairness, inclusivity, and transparency. However, limitations include the need for more inclusive model coverage, potential bias in task selection, and the dynamic nature of LLM technologies, which may render findings obsolete over time.**CodeEditorBench: Evaluating Code Editing Capability of Large Language Models**
**Introduction:**
Large Language Models (LLMs) are rapidly evolving in code editing capabilities, which are crucial for software development. CodeEditorBench is an evaluation framework designed to rigorously assess LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks that focus solely on code generation, CodeEditorBench emphasizes real-world scenarios and practical aspects of software development.
**Methodology:**
CodeEditorBench curates diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks. The dataset is enriched with LLM-generated test cases, verified by an Online Judge System (OJ). The benchmark is developed for four problem types using specific methodologies, and the evaluation involves crafting prompts for zero-shot, three-shot, and chain of thought settings. Outputs are filtered and integrated with templates for compilation, and the OJ's batch judging determines the LLMs' scores.
**Results:**
The evaluation of 19 LLMs reveals that closed-source models, particularly Gemini-Ultra and GPT-4, outperform open-source models in CodeEditorBench. GPT-4 excels in three out of four areas, while Gemini-Ultra performs well in few-shot scenarios. The analysis highlights the variability in model performance based on problem category and scenario, with smaller models sometimes surpassing larger ones in efficiency.
**Conclusion:**
CodeEditorBench aims to catalyze advancements in LLMs for code editing by providing a robust platform for assessment. The benchmark will be periodically updated to incorporate new problems, scenarios, and models. The findings contribute to the advancement of LLMs in code editing and serve as a valuable resource for researchers and practitioners.
**Ethics and Limitations:**
The research prioritizes ethical considerations, ensuring fairness, inclusivity, and transparency. However, limitations include the need for more inclusive model coverage, potential bias in task selection, and the dynamic nature of LLM technologies, which may render findings obsolete over time.