[slides and audio] CoIR%3A A Comprehensive Benchmark for Code Information Retrieval Models

CoIR is a comprehensive benchmark for code information retrieval models, designed to evaluate the effectiveness of code retrieval across diverse domains and tasks. It includes ten meticulously curated code datasets spanning eight distinct retrieval tasks across seven diverse domains. The benchmark aims to address the limitations of existing code retrieval benchmarks by providing a rich and standardized evaluation framework. CoIR supports 14 main programming languages and includes datasets of varying sizes, ranging from 1K to 1M corpus. The benchmark is designed to be user-friendly, with a Python framework that can be easily installed via pip and is compatible with existing benchmarks like MTEB and BEIR. CoIR includes four main retrieval tasks: Text-to-Code Retrieval, Code-to-Code Retrieval, Code-to-Text Retrieval, and Hybrid Code Retrieval, each further broken down into sub-tasks. The benchmark evaluates nine widely used retrieval models, revealing significant challenges in performing code retrieval tasks even with state-of-the-art systems. CoIR provides a standardized evaluation package, enabling seamless cross-benchmark comparisons. The benchmark also includes a comprehensive dataset statistics table and detailed information on the preparation of each dataset. The evaluation metrics used include NDCG@10, MAP, Recall, and Precision. The results show that even the best models perform suboptimally on CoIR, highlighting the complexity and challenges of code retrieval. The benchmark also considers the trade-off between accuracy and latency, as well as the impact of input length on model performance. CoIR aims to stimulate advances in code retrieval by providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems.CoIR is a comprehensive benchmark for code information retrieval models, designed to evaluate the effectiveness of code retrieval across diverse domains and tasks. It includes ten meticulously curated code datasets spanning eight distinct retrieval tasks across seven diverse domains. The benchmark aims to address the limitations of existing code retrieval benchmarks by providing a rich and standardized evaluation framework. CoIR supports 14 main programming languages and includes datasets of varying sizes, ranging from 1K to 1M corpus. The benchmark is designed to be user-friendly, with a Python framework that can be easily installed via pip and is compatible with existing benchmarks like MTEB and BEIR. CoIR includes four main retrieval tasks: Text-to-Code Retrieval, Code-to-Code Retrieval, Code-to-Text Retrieval, and Hybrid Code Retrieval, each further broken down into sub-tasks. The benchmark evaluates nine widely used retrieval models, revealing significant challenges in performing code retrieval tasks even with state-of-the-art systems. CoIR provides a standardized evaluation package, enabling seamless cross-benchmark comparisons. The benchmark also includes a comprehensive dataset statistics table and detailed information on the preparation of each dataset. The evaluation metrics used include NDCG@10, MAP, Recall, and Precision. The results show that even the best models perform suboptimally on CoIR, highlighting the complexity and challenges of code retrieval. The benchmark also considers the trade-off between accuracy and latency, as well as the impact of input length on model performance. CoIR aims to stimulate advances in code retrieval by providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems.

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

3 Jul 2024 | Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Yichun Yin, Hao Zhang, Yong Liu, Yasheng Wang, Ruiming Tang