An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

An Empirical Evaluation of LLMs for Solving Offensive Security Challenges

19 Feb 2024 | Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, Muhammad Shafique
This paper evaluates the effectiveness of large language models (LLMs) in solving Capture The Flag (CTF) challenges, a popular cybersecurity competition. The authors develop two workflows—human-in-the-loop (HTIL) and fully-automated—to assess LLMs' ability to solve CTF challenges. They collect results from human contestants and find that LLMs achieve higher success rates than average human participants. The study uses six LLMs—GPT-3.5, GPT-4, Claude, Bard, DeepSeek Coder, and Mixtral—and focuses on 26 diverse CTF problems. The results show that ChatGPT performs comparably to average human CTF teams. The paper also analyzes the typical shortcomings of LLMs when tackling CTF challenges, highlighting the limitations of relying solely on LLMs without human intervention. The authors provide a comprehensive evaluation of LLMs' capabilities in solving real-world CTF challenges and offer insights for applying LLMs in cybersecurity education and systematic evaluation of offensive cybersecurity capabilities.This paper evaluates the effectiveness of large language models (LLMs) in solving Capture The Flag (CTF) challenges, a popular cybersecurity competition. The authors develop two workflows—human-in-the-loop (HTIL) and fully-automated—to assess LLMs' ability to solve CTF challenges. They collect results from human contestants and find that LLMs achieve higher success rates than average human participants. The study uses six LLMs—GPT-3.5, GPT-4, Claude, Bard, DeepSeek Coder, and Mixtral—and focuses on 26 diverse CTF problems. The results show that ChatGPT performs comparably to average human CTF teams. The paper also analyzes the typical shortcomings of LLMs when tackling CTF challenges, highlighting the limitations of relying solely on LLMs without human intervention. The authors provide a comprehensive evaluation of LLMs' capabilities in solving real-world CTF challenges and offer insights for applying LLMs in cybersecurity education and systematic evaluation of offensive cybersecurity capabilities.
Reach us at info@study.space