MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations

April 15–16, 2024 | Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, Shaohua Wang
The paper introduces MegaVul, a large-scale and comprehensive C/C++ vulnerability dataset constructed by crawling the Common Vulnerabilities and Exposures (CVE) database and related open-source projects. The dataset includes 17,380 vulnerabilities from 922 repositories, spanning 169 different vulnerability types disclosed from January 2006 to October 2023. The authors collected descriptive information from the CVE database and extracted code changes from 28 Git-based websites, ensuring code integrity and enriching it with four transformed representations. MegaVul is publicly available on GitHub and can be used for various software security tasks, including vulnerability detection and severity assessment. The dataset addresses limitations of existing datasets by providing a rich, continuously updated, and high-quality resource for research and development in software security.The paper introduces MegaVul, a large-scale and comprehensive C/C++ vulnerability dataset constructed by crawling the Common Vulnerabilities and Exposures (CVE) database and related open-source projects. The dataset includes 17,380 vulnerabilities from 922 repositories, spanning 169 different vulnerability types disclosed from January 2006 to October 2023. The authors collected descriptive information from the CVE database and extracted code changes from 28 Git-based websites, ensuring code integrity and enriching it with four transformed representations. MegaVul is publicly available on GitHub and can be used for various software security tasks, including vulnerability detection and severity assessment. The dataset addresses limitations of existing datasets by providing a rich, continuously updated, and high-quality resource for research and development in software security.
Reach us at info@study.space