ReposVul: A Repository-Level High-Quality Vulnerability Dataset

ReposVul: A Repository-Level High-Quality Vulnerability Dataset

8 Feb 2024 | Xinchen Wang*, Ruida Hu*, Cuiyun Gao*, Xin-Cheng Wen, Yujia Chen, Qing Liao
**ReposVul: A Repository-Level High-Quality Vulnerability Dataset** **Authors:** Xinchen Wang, Xin-Cheng Wen, Ruida Hu, Yujia Chen, Cuiyun Gao, Qing Liao **Abstract:** Open-Source Software (OSS) vulnerabilities pose significant security risks to society. Deep learning (DL)-based approaches have proven effective in automated vulnerability detection, but their performance relies heavily on the quality and quantity of labeled data. Existing datasets suffer from issues such as tangled patches, lack of inter-procedural vulnerabilities, and outdated patches. To address these limitations, the authors propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The framework includes three modules: a vulnerability untangling module to distinguish vulnerability-fixing related code changes from tangled patches, a multi-granularity dependency extraction module to capture inter-procedural call relationships, and a trace-based filtering module to identify outdated patches. ReposVul covers 6,134 CVE entries across 1,491 projects and four programming languages, providing detailed multi-granularity patch information. Data analysis and manual checking demonstrate that ReposVul is high in quality and addresses the limitations of existing datasets. **Contributions:** 1. An automated data collection framework for obtaining vulnerability data. 2. ReposVul, the first repository-level vulnerability dataset with multi-granularity information. 3. Manual checking and data analysis show that ReposVul is high in quality and alleviates the limitations of existing datasets. **Keywords:** Open-Source Software, Software Vulnerability Datasets, Data Quality **Introduction:** The paper discusses the challenges of OSS vulnerabilities and the importance of high-quality datasets for effective vulnerability detection. It outlines the limitations of existing datasets and introduces the ReposVul framework to address these issues. The framework includes a vulnerability untangling module, a multi-granularity dependency extraction module, and a trace-based filtering module. **Evaluation and Experimental Results:** The paper evaluates ReposVul's advantages over existing datasets, the quality of its data labels, the impact of different LLMs and prompt design on label quality, and its performance in filtering outdated patches. The results show that ReposVul outperforms existing datasets in label quality and provides rich additional information. **Discussion:** The paper discusses the application of ReposVul in multi-granularity vulnerability detection, patch management, and vulnerability repair. It also addresses threats and limitations, such as the source platforms used for data collection and the programming languages covered. **Conclusion:** The paper concludes by highlighting the significance of ReposVul in promoting standardized and practical evaluation of model performance in OSS vulnerability detection. The source code and ReposVul dataset are available for public use.**ReposVul: A Repository-Level High-Quality Vulnerability Dataset** **Authors:** Xinchen Wang, Xin-Cheng Wen, Ruida Hu, Yujia Chen, Cuiyun Gao, Qing Liao **Abstract:** Open-Source Software (OSS) vulnerabilities pose significant security risks to society. Deep learning (DL)-based approaches have proven effective in automated vulnerability detection, but their performance relies heavily on the quality and quantity of labeled data. Existing datasets suffer from issues such as tangled patches, lack of inter-procedural vulnerabilities, and outdated patches. To address these limitations, the authors propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The framework includes three modules: a vulnerability untangling module to distinguish vulnerability-fixing related code changes from tangled patches, a multi-granularity dependency extraction module to capture inter-procedural call relationships, and a trace-based filtering module to identify outdated patches. ReposVul covers 6,134 CVE entries across 1,491 projects and four programming languages, providing detailed multi-granularity patch information. Data analysis and manual checking demonstrate that ReposVul is high in quality and addresses the limitations of existing datasets. **Contributions:** 1. An automated data collection framework for obtaining vulnerability data. 2. ReposVul, the first repository-level vulnerability dataset with multi-granularity information. 3. Manual checking and data analysis show that ReposVul is high in quality and alleviates the limitations of existing datasets. **Keywords:** Open-Source Software, Software Vulnerability Datasets, Data Quality **Introduction:** The paper discusses the challenges of OSS vulnerabilities and the importance of high-quality datasets for effective vulnerability detection. It outlines the limitations of existing datasets and introduces the ReposVul framework to address these issues. The framework includes a vulnerability untangling module, a multi-granularity dependency extraction module, and a trace-based filtering module. **Evaluation and Experimental Results:** The paper evaluates ReposVul's advantages over existing datasets, the quality of its data labels, the impact of different LLMs and prompt design on label quality, and its performance in filtering outdated patches. The results show that ReposVul outperforms existing datasets in label quality and provides rich additional information. **Discussion:** The paper discusses the application of ReposVul in multi-granularity vulnerability detection, patch management, and vulnerability repair. It also addresses threats and limitations, such as the source platforms used for data collection and the programming languages covered. **Conclusion:** The paper concludes by highlighting the significance of ReposVul in promoting standardized and practical evaluation of model performance in OSS vulnerability detection. The source code and ReposVul dataset are available for public use.
Reach us at info@study.space