Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

17 Jun 2024 | Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, Timofey Bryksin
Long Code Arena is a set of six benchmarks for evaluating long-context code models. The benchmarks cover tasks such as library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. Each task includes a manually verified dataset, an evaluation suite, and open-source baseline solutions based on popular large language models (LLMs). The benchmarks are designed to require models to use information from a project module or the entire project to complete the task. The datasets are sourced from open-source GitHub repositories with permissive licenses. Baseline solutions are provided to aid future research. The benchmarks are published on HuggingFace Spaces with a leaderboard and links to datasets on HuggingFace Hub and a GitHub repository with baselines. The work addresses the lack of benchmarks for code processing that go beyond a single file of context, and aims to provide a comprehensive evaluation of models for realistic software engineering tasks. The benchmarks include a variety of tasks that require different aspects of code processing, and the datasets are carefully curated to ensure data quality. The work also discusses related work, limitations, and future directions for research in ML4SE and NLP.Long Code Arena is a set of six benchmarks for evaluating long-context code models. The benchmarks cover tasks such as library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. Each task includes a manually verified dataset, an evaluation suite, and open-source baseline solutions based on popular large language models (LLMs). The benchmarks are designed to require models to use information from a project module or the entire project to complete the task. The datasets are sourced from open-source GitHub repositories with permissive licenses. Baseline solutions are provided to aid future research. The benchmarks are published on HuggingFace Spaces with a leaderboard and links to datasets on HuggingFace Hub and a GitHub repository with baselines. The work addresses the lack of benchmarks for code processing that go beyond a single file of context, and aims to provide a comprehensive evaluation of models for realistic software engineering tasks. The benchmarks include a variety of tasks that require different aspects of code processing, and the datasets are carefully curated to ensure data quality. The work also discusses related work, limitations, and future directions for research in ML4SE and NLP.
Reach us at info@study.space