10 Jun 2024 | Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, Lingming Zhang
RepoQA is a benchmark designed to evaluate the long-context code understanding ability of Large Language Models (LLMs). It introduces the Searching Needle Function (SNF) task, which requires LLMs to search for a function based on a natural-language description. The benchmark includes 500 code search tasks from 50 popular repositories across 5 programming languages. By evaluating 33 LLMs, RepoQA shows that proprietary models still outperform open-source ones, and performance varies by language and model size. Models may understand code better without comments. RepoQA is multilingual and comprehensive, covering a wide range of code search scenarios. It also demonstrates that models can perform well without comments, suggesting that LLMs can understand code without explicit documentation. The benchmark provides a standardized way to evaluate long-context code understanding, filling a gap in existing benchmarks that focus on general or synthetic tasks. The results highlight the importance of context length and the need for more complex tasks to fully assess LLM capabilities in code understanding. Future work includes expanding the SNF task and creating more complex code-related evaluations.RepoQA is a benchmark designed to evaluate the long-context code understanding ability of Large Language Models (LLMs). It introduces the Searching Needle Function (SNF) task, which requires LLMs to search for a function based on a natural-language description. The benchmark includes 500 code search tasks from 50 popular repositories across 5 programming languages. By evaluating 33 LLMs, RepoQA shows that proprietary models still outperform open-source ones, and performance varies by language and model size. Models may understand code better without comments. RepoQA is multilingual and comprehensive, covering a wide range of code search scenarios. It also demonstrates that models can perform well without comments, suggesting that LLMs can understand code without explicit documentation. The benchmark provides a standardized way to evaluate long-context code understanding, filling a gap in existing benchmarks that focus on general or synthetic tasks. The results highlight the importance of context length and the need for more complex tasks to fully assess LLM capabilities in code understanding. Future work includes expanding the SNF task and creating more complex code-related evaluations.