ReposVul: A Repository-Level High-Quality Vulnerability Dataset

ReposVul: A Repository-Level High-Quality Vulnerability Dataset

2024 | Xinchen Wang*, Ruida Hu*, Cuiyun Gao*, Xin-Cheng Wen, Yujia Chen, Qing Liao
ReposVul: A Repository-Level High-Quality Vulnerability Dataset This paper introduces ReposVul, the first repository-level high-quality vulnerability dataset, addressing limitations in existing vulnerability datasets. Existing datasets suffer from tangled patches, lack of inter-procedural vulnerabilities, and outdated patches. To address these issues, the authors propose an automated data collection framework with three modules: (1) a vulnerability untangling module to distinguish vulnerability-fixing related code changes from tangled patches using Large Language Models (LLMs) and static analysis tools; (2) a multi-granularity dependency extraction module to capture inter-procedural call relationships across the repository; and (3) a trace-based filtering module to identify outdated patches based on file path and commit time traces. ReposVul contains 6,134 CVE entries across 1,491 projects and four programming languages, with detailed multi-granularity patch information. The dataset is high-quality and addresses the limitations of existing vulnerability datasets. The authors also evaluate the effectiveness of their framework, demonstrating that ReposVul outperforms existing datasets in label quality and provides rich additional information such as CVE descriptions, CVSS, and patch submission history. The vulnerability untangling module uses LLMs and static analysis tools to identify vulnerability-fixing related files. The multi-granularity dependency extraction module captures inter-procedural call relationships at repository, file, function, and line levels. The trace-based filtering module identifies outdated patches by analyzing submission history and commit time. ReposVul is the first repository-level vulnerability dataset, providing a comprehensive set of vulnerability data for researchers and practitioners. It supports DL-based vulnerability detection methods and can be used for various OSS vulnerability-related tasks, including patch management, vulnerability repair, and vulnerability detection. The authors also discuss the threats and limitations of their dataset, including the potential for missing projects from other platforms and the limited scope of programming languages. Future work includes expanding the dataset to more languages and including more CVEs.ReposVul: A Repository-Level High-Quality Vulnerability Dataset This paper introduces ReposVul, the first repository-level high-quality vulnerability dataset, addressing limitations in existing vulnerability datasets. Existing datasets suffer from tangled patches, lack of inter-procedural vulnerabilities, and outdated patches. To address these issues, the authors propose an automated data collection framework with three modules: (1) a vulnerability untangling module to distinguish vulnerability-fixing related code changes from tangled patches using Large Language Models (LLMs) and static analysis tools; (2) a multi-granularity dependency extraction module to capture inter-procedural call relationships across the repository; and (3) a trace-based filtering module to identify outdated patches based on file path and commit time traces. ReposVul contains 6,134 CVE entries across 1,491 projects and four programming languages, with detailed multi-granularity patch information. The dataset is high-quality and addresses the limitations of existing vulnerability datasets. The authors also evaluate the effectiveness of their framework, demonstrating that ReposVul outperforms existing datasets in label quality and provides rich additional information such as CVE descriptions, CVSS, and patch submission history. The vulnerability untangling module uses LLMs and static analysis tools to identify vulnerability-fixing related files. The multi-granularity dependency extraction module captures inter-procedural call relationships at repository, file, function, and line levels. The trace-based filtering module identifies outdated patches by analyzing submission history and commit time. ReposVul is the first repository-level vulnerability dataset, providing a comprehensive set of vulnerability data for researchers and practitioners. It supports DL-based vulnerability detection methods and can be used for various OSS vulnerability-related tasks, including patch management, vulnerability repair, and vulnerability detection. The authors also discuss the threats and limitations of their dataset, including the potential for missing projects from other platforms and the limited scope of programming languages. Future work includes expanding the dataset to more languages and including more CVEs.
Reach us at info@study.space
[slides] ReposVul%3A A Repository-Level High-Quality Vulnerability Dataset | StudySpace