[slides] EvoCodeBench%3A An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

EvoCodeBench is a new code generation benchmark designed to evaluate the coding abilities of Large Language Models (LLMs) in real-world code repositories. It addresses the shortcomings of existing benchmarks by aligning with real-world code distributions, providing comprehensive annotations, and avoiding data leakage through an evolving pipeline. EvoCodeBench-2403, the first version, contains 275 samples from 25 real-world repositories and includes detailed annotations such as requirements, reference code, and dependencies. The benchmark offers robust evaluation metrics like Pass@k and Recall@k to assess the functional correctness and dependency recall of generated code. The benchmark supports repository-level code generation, simulating developers' coding processes in real-world repositories. It evaluates 10 popular LLMs, including gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5. Results show that these models perform poorly on EvoCodeBench compared to previous benchmarks, with gpt-4 achieving only 20.73% Pass@1. Analysis of failed cases highlights the limitations of existing LLMs in handling complex code generation tasks. EvoCodeBench is an evolving benchmark that will be dynamically updated to avoid data leakage. It provides a realistic evaluation scenario for repository-level code generation and aims to improve the coding abilities of LLMs in real-world applications. The benchmark includes comprehensive annotations, test cases, and evaluation metrics to facilitate further research and community analysis. The results demonstrate the importance of context in code generation and the need for improved reasoning and context handling in LLMs. EvoCodeBench offers a more challenging and realistic evaluation scenario compared to previous benchmarks, facilitating the application of code generation techniques in real-world repositories.EvoCodeBench is a new code generation benchmark designed to evaluate the coding abilities of Large Language Models (LLMs) in real-world code repositories. It addresses the shortcomings of existing benchmarks by aligning with real-world code distributions, providing comprehensive annotations, and avoiding data leakage through an evolving pipeline. EvoCodeBench-2403, the first version, contains 275 samples from 25 real-world repositories and includes detailed annotations such as requirements, reference code, and dependencies. The benchmark offers robust evaluation metrics like Pass@k and Recall@k to assess the functional correctness and dependency recall of generated code. The benchmark supports repository-level code generation, simulating developers' coding processes in real-world repositories. It evaluates 10 popular LLMs, including gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5. Results show that these models perform poorly on EvoCodeBench compared to previous benchmarks, with gpt-4 achieving only 20.73% Pass@1. Analysis of failed cases highlights the limitations of existing LLMs in handling complex code generation tasks. EvoCodeBench is an evolving benchmark that will be dynamically updated to avoid data leakage. It provides a realistic evaluation scenario for repository-level code generation and aims to improve the coding abilities of LLMs in real-world applications. The benchmark includes comprehensive annotations, test cases, and evaluation metrics to facilitate further research and community analysis. The results demonstrate the importance of context in code generation and the need for improved reasoning and context handling in LLMs. EvoCodeBench offers a more challenging and realistic evaluation scenario compared to previous benchmarks, facilitating the application of code generation techniques in real-world repositories.

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

2024-03-31 | Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, Zhi Jin