April 14-20, 2024 | Xin Zhou, Ting Zhang, and David Lo
This paper explores the effectiveness of large pre-trained language models (LLMs), specifically GPT-3.5 and GPT-4, in detecting software vulnerabilities. Previous methods relied on medium-sized pre-trained models or smaller neural networks. Recent advancements in LLMs have shown strong few-shot learning capabilities, but their performance in vulnerability detection remains unexplored. The study investigates how LLMs perform with various prompts, focusing on GPT-3.5 and GPT-4. Experimental results show that GPT-3.5 achieves competitive performance with CodeBERT, while GPT-4 outperforms CodeBERT by 34.8% in accuracy. The study also explores different prompt designs to enhance LLM performance in vulnerability detection, including role descriptions, project information, external source knowledge, and training set samples. The results indicate that combining these prompts significantly improves performance. The study also highlights the strengths of GPT-3.5 in precision and CodeBERT in recall. The paper discusses future directions, including the development of local and specialized LLMs for vulnerability detection, improving precision and robustness, addressing long-tailed distributions of vulnerability data, and fostering trust and collaboration between developers and AI-powered solutions. The study concludes that LLMs show promise in vulnerability detection, but further research is needed to address limitations and improve effectiveness.This paper explores the effectiveness of large pre-trained language models (LLMs), specifically GPT-3.5 and GPT-4, in detecting software vulnerabilities. Previous methods relied on medium-sized pre-trained models or smaller neural networks. Recent advancements in LLMs have shown strong few-shot learning capabilities, but their performance in vulnerability detection remains unexplored. The study investigates how LLMs perform with various prompts, focusing on GPT-3.5 and GPT-4. Experimental results show that GPT-3.5 achieves competitive performance with CodeBERT, while GPT-4 outperforms CodeBERT by 34.8% in accuracy. The study also explores different prompt designs to enhance LLM performance in vulnerability detection, including role descriptions, project information, external source knowledge, and training set samples. The results indicate that combining these prompts significantly improves performance. The study also highlights the strengths of GPT-3.5 in precision and CodeBERT in recall. The paper discusses future directions, including the development of local and specialized LLMs for vulnerability detection, improving precision and robustness, addressing long-tailed distributions of vulnerability data, and fostering trust and collaboration between developers and AI-powered solutions. The study concludes that LLMs show promise in vulnerability detection, but further research is needed to address limitations and improve effectiveness.