10 Jun 2024 | Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, Lingming Zhang
The paper introduces RepoQA, a benchmark designed to evaluate the long-context code understanding capabilities of Large Language Models (LLMs). Traditional evaluations, such as the *Needle in a Haystack* (NIAH) benchmark, focus on general and synthetic tasks, but they overlook the specific challenges of working with long-context code, particularly in repositories. RepoQA addresses this gap by introducing the *Searching Needle Function* (SNF) task, which tests LLMs' ability to search for functions given their natural-language descriptions. The benchmark includes 500 code search tasks from 50 popular repositories across 5 programming languages, making it the first multilingual and comprehensive benchmark for long-context code understanding.
The paper outlines the design of RepoQA, including data curation and model evaluation. Data curation involves selecting high-quality repositories and identifying needle functions, which are then annotated with natural-language descriptions. Model evaluation uses a pipeline that constructs long-context tests and measures the accuracy of LLMs in retrieving the correct functions. The evaluation covers 33 models and reveals several insights:
- There is a small gap between the best open and proprietary models.
- Different models perform better in different programming languages.
- Models may understand code better without comments.
The paper also discusses the impact of natural comments on retrieval accuracy, finding that removing comments can improve performance for most models. Additionally, it explores the difficulty of different programming languages, noting that models generally perform best in Java and TypeScript, followed by Python, C++, and Rust.
Finally, the paper concludes by highlighting the potential of RepoQA to advance the field of long-context code understanding and outlines future work, including expanding the SNF task and constructing more complex tasks.The paper introduces RepoQA, a benchmark designed to evaluate the long-context code understanding capabilities of Large Language Models (LLMs). Traditional evaluations, such as the *Needle in a Haystack* (NIAH) benchmark, focus on general and synthetic tasks, but they overlook the specific challenges of working with long-context code, particularly in repositories. RepoQA addresses this gap by introducing the *Searching Needle Function* (SNF) task, which tests LLMs' ability to search for functions given their natural-language descriptions. The benchmark includes 500 code search tasks from 50 popular repositories across 5 programming languages, making it the first multilingual and comprehensive benchmark for long-context code understanding.
The paper outlines the design of RepoQA, including data curation and model evaluation. Data curation involves selecting high-quality repositories and identifying needle functions, which are then annotated with natural-language descriptions. Model evaluation uses a pipeline that constructs long-context tests and measures the accuracy of LLMs in retrieving the correct functions. The evaluation covers 33 models and reveals several insights:
- There is a small gap between the best open and proprietary models.
- Different models perform better in different programming languages.
- Models may understand code better without comments.
The paper also discusses the impact of natural comments on retrieval accuracy, finding that removing comments can improve performance for most models. Additionally, it explores the difficulty of different programming languages, noting that models generally perform best in Java and TypeScript, followed by Python, C++, and Rust.
Finally, the paper concludes by highlighting the potential of RepoQA to advance the field of long-context code understanding and outlines future work, including expanding the SNF task and constructing more complex tasks.