20 Jun 2024 | Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, Daniel Fried
CODERAG-BENCH: Can Retrieval Augment Code Generation?
This paper introduces CODERAG-BENCH, a comprehensive benchmark for evaluating retrieval-augmented code generation (RACG). The benchmark includes three categories of code generation tasks: basic programming, open-domain, and repository-level problems. It aggregates documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. The benchmark includes 9k coding problems and 25M retrieval documents, enabling various experiments and providing reproducible evaluations for RACG.
The paper explores the effectiveness of retrieval-augmented code generation by analyzing the performance of top-performing retrieval and generation models on CODERAG-BENCH. It finds that retrieval-augmented generation improves code generation in many scenarios, but current retrievers still struggle to fetch useful contexts, especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts.
The paper also explores RACG with open retrieval, i.e., retrieving documents from various sources with different chunking strategies. It finds that each type of coding task can benefit from functionally relevant snippets from certain sources, and chunking documents to 200–800 tokens often gives the best results.
The paper concludes that CODERAG-BENCH serves as an effective testbed to encourage further development of advanced code-oriented RAG methods. It highlights the need for better retrieval models and more efficient chunking strategies to improve RACG performance. The paper also discusses the challenges of using RACG for stronger models, and the importance of context length and retrieval quality in achieving effective code generation.CODERAG-BENCH: Can Retrieval Augment Code Generation?
This paper introduces CODERAG-BENCH, a comprehensive benchmark for evaluating retrieval-augmented code generation (RACG). The benchmark includes three categories of code generation tasks: basic programming, open-domain, and repository-level problems. It aggregates documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. The benchmark includes 9k coding problems and 25M retrieval documents, enabling various experiments and providing reproducible evaluations for RACG.
The paper explores the effectiveness of retrieval-augmented code generation by analyzing the performance of top-performing retrieval and generation models on CODERAG-BENCH. It finds that retrieval-augmented generation improves code generation in many scenarios, but current retrievers still struggle to fetch useful contexts, especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts.
The paper also explores RACG with open retrieval, i.e., retrieving documents from various sources with different chunking strategies. It finds that each type of coding task can benefit from functionally relevant snippets from certain sources, and chunking documents to 200–800 tokens often gives the best results.
The paper concludes that CODERAG-BENCH serves as an effective testbed to encourage further development of advanced code-oriented RAG methods. It highlights the need for better retrieval models and more efficient chunking strategies to improve RACG performance. The paper also discusses the challenges of using RACG for stronger models, and the importance of context length and retrieval quality in achieving effective code generation.