[slides and audio] MegaVul%3A A C%2FC%2B%2B Vulnerability Dataset with Comprehensive Code Representations

MegaVul is a large-scale and comprehensive C/C++ vulnerability dataset constructed by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. It contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. The dataset includes comprehensive code representations, such as function signatures, abstracted functions, parsed functions, and code changes. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages. The dataset was constructed by first crawling the public CVE database to collect all available descriptive information of a CVE, including the CVE severity score, references linking to the affected products, etc. Then, through the CVE references, relevant products hosted on Git-based websites were identified, and vulnerability-related code commits were extracted. Advanced tools were used to ensure the extracted code integrity and enrich the code with four different transformed representations. MegaVul addresses the limitations of existing datasets, such as unreal vulnerability, unreal data distribution, limited diversity, limited newly disclosed vulnerabilities, and low-quality of dataset. It provides a rich set of function information, including vulnerability functions, abstracted functions, function graphs, and other descriptive information such as the types of CWEs and CVE description. Researchers can utilize the rich information to train state-of-the-art deep learning models designed for vulnerability detection, including sequence-based and graph-based models. MegaVul can be used for various software security-related tasks, including deep analysis on vulnerability characteristics, data-driven vulnerability detection, and identification of vulnerability fixing patches. The dataset is also useful for training vulnerability detection models for better use in real-world production environments. The dataset is continuously updated and expanded to improve its quality and advance research in the field of security.MegaVul is a large-scale and comprehensive C/C++ vulnerability dataset constructed by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. It contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. The dataset includes comprehensive code representations, such as function signatures, abstracted functions, parsed functions, and code changes. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages. The dataset was constructed by first crawling the public CVE database to collect all available descriptive information of a CVE, including the CVE severity score, references linking to the affected products, etc. Then, through the CVE references, relevant products hosted on Git-based websites were identified, and vulnerability-related code commits were extracted. Advanced tools were used to ensure the extracted code integrity and enrich the code with four different transformed representations. MegaVul addresses the limitations of existing datasets, such as unreal vulnerability, unreal data distribution, limited diversity, limited newly disclosed vulnerabilities, and low-quality of dataset. It provides a rich set of function information, including vulnerability functions, abstracted functions, function graphs, and other descriptive information such as the types of CWEs and CVE description. Researchers can utilize the rich information to train state-of-the-art deep learning models designed for vulnerability detection, including sequence-based and graph-based models. MegaVul can be used for various software security-related tasks, including deep analysis on vulnerability characteristics, data-driven vulnerability detection, and identification of vulnerability fixing patches. The dataset is also useful for training vulnerability detection models for better use in real-world production environments. The dataset is continuously updated and expanded to improve its quality and advance research in the field of security.

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations

April 15-16, 2024 | Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu*, Shaohua Wang