CODERAG-BENCH: Can Retrieval Augment Code Generation?

CODERAG-BENCH: Can Retrieval Augment Code Generation?

20 Jun 2024 | Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried
**Abstract:** This paper introduces CODERAG-BENCH, a comprehensive benchmark for retrieval-augmented code generation (RACG) tasks. The benchmark aims to explore the potential of RAG in improving code generation by providing external contexts such as library documentation. CODERAG-BENCH includes three categories of code generation tasks: basic programming, open-domain, and repository-level problems. It aggregates documents from five sources: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. The study examines the effectiveness of different retrieval models and code generation models by providing contexts retrieved from one or multiple sources. While notable gains are observed in final code generation, current retrievers struggle to fetch useful contexts, especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. **Introduction:** The task of generating code from natural language descriptions has advanced significantly with language models (LMs). However, many programs remain challenging for LMs to generate using their parametric knowledge alone. Retrieval-augmented generation (RAG) retrieves and incorporates relevant documents at inference time, reducing the need for all knowledge within model parameters. While RAG has shown success in various text-oriented tasks, its potential for improving code generation is under-explored. CODERAG-BENCH addresses this gap by providing a unified benchmark for RACG systems, encompassing diverse coding tasks and retrieval sources. **Key Findings:** - **Retrieval Models:** Current retrieval models struggle to find accurate and helpful documents, especially in open-domain tasks. - **Code Generation Models:** Many models experience limited context capacity and RAG abilities, leading to suboptimal RACG results. - **Efficiency:** Larger retrieval models often outperform smaller ones but introduce significant costs in terms of encoding latency, storage requirements, and index storage. - **Context Lengths:** Providing five documents yields the best results in most settings, except for StarCoder2 on RepoEval, which benefits from eight documents. - **Open Retrieval:** RACG with open retrieval from all sources can improve the performance of weaker models, but stronger models may struggle with additional contexts. **Conclusion:** CODERAG-BENCH serves as a solid testbed for advancing RACG systems. While retrieving external documents can greatly benefit code generation, current models still face challenges in selecting accurate contexts and integrating them effectively. Further research is needed to improve the efficiency and effectiveness of RAG in code generation tasks.**Abstract:** This paper introduces CODERAG-BENCH, a comprehensive benchmark for retrieval-augmented code generation (RACG) tasks. The benchmark aims to explore the potential of RAG in improving code generation by providing external contexts such as library documentation. CODERAG-BENCH includes three categories of code generation tasks: basic programming, open-domain, and repository-level problems. It aggregates documents from five sources: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. The study examines the effectiveness of different retrieval models and code generation models by providing contexts retrieved from one or multiple sources. While notable gains are observed in final code generation, current retrievers struggle to fetch useful contexts, especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. **Introduction:** The task of generating code from natural language descriptions has advanced significantly with language models (LMs). However, many programs remain challenging for LMs to generate using their parametric knowledge alone. Retrieval-augmented generation (RAG) retrieves and incorporates relevant documents at inference time, reducing the need for all knowledge within model parameters. While RAG has shown success in various text-oriented tasks, its potential for improving code generation is under-explored. CODERAG-BENCH addresses this gap by providing a unified benchmark for RACG systems, encompassing diverse coding tasks and retrieval sources. **Key Findings:** - **Retrieval Models:** Current retrieval models struggle to find accurate and helpful documents, especially in open-domain tasks. - **Code Generation Models:** Many models experience limited context capacity and RAG abilities, leading to suboptimal RACG results. - **Efficiency:** Larger retrieval models often outperform smaller ones but introduce significant costs in terms of encoding latency, storage requirements, and index storage. - **Context Lengths:** Providing five documents yields the best results in most settings, except for StarCoder2 on RepoEval, which benefits from eight documents. - **Open Retrieval:** RACG with open retrieval from all sources can improve the performance of weaker models, but stronger models may struggle with additional contexts. **Conclusion:** CODERAG-BENCH serves as a solid testbed for advancing RACG systems. While retrieving external documents can greatly benefit code generation, current models still face challenges in selecting accurate contexts and integrating them effectively. Further research is needed to improve the efficiency and effectiveness of RAG in code generation tasks.
Reach us at info@study.space