19 Feb 2024 | Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, Muhammad Shafique
This paper evaluates the effectiveness of large language models (LLMs) in solving Capture the Flag (CTF) challenges. The authors developed two workflows: human-in-the-loop (HITL) and fully-automated, to assess LLMs' ability to solve CTF challenges. They collected human contestants' results on the same set of questions and found that LLMs achieved a higher success rate than an average human participant. The study provides a comprehensive evaluation of LLMs' capabilities in solving real-world CTF challenges, from real competition to fully automated workflows. The results offer insights for applying LLMs in cybersecurity education and pave the way for systematic evaluation of offensive cybersecurity capabilities in LLMs.
The study involved six LLMs: GPT-3.5, GPT-4, Claude, Bard, DeepSeek Coder, and Mixtral. However, in the study involving human participants, all teams used ChatGPT, which performed comparably to an average-performing human CTF team. The authors built two workflows for solving CTF questions using LLMs and presented their success rates. They also analyzed the typical shortcomings of LLMs when tackling CTF challenges, illustrating the limitations of relying solely on LLMs without human intervention.
The study found that ChatGPT solved 11 out of 21 questions in the fully-automated workflow, the highest among all LLMs. It was the preferred choice for all teams during the competition. However, it had limitations in code translations, calculation accuracy, and a tendency to offer general responses. Bard solved only 2 questions correctly and was the only LLM in the experiment that yielded a null return value. Claude solved 6 out of 21 challenges, demonstrating accurate understanding of the problems for more than half of the total challenge database. DeepSeek Coder solved one more text-only question than Claude and was able to fix its answer on the second or third output. Mixtral performed worse than the other LLMs, with a lower success rate in solving challenges.
The study also compared the performance of LLMs against human CTF teams. GPT-4 outperformed 88.5% of human CTF players in the real-world CTF competition. The results show that LLMs have significant potential to play a role in CTF competitions comparable to a human CTF player. However, the study also highlights the importance of human expertise in solving CTF challenges, as providing human feedback to LLMs can significantly decrease failures and boost their accuracy. LLMs are valuable in helping users solve and understand CTF challenges but are not yet ready to replace human expertise.This paper evaluates the effectiveness of large language models (LLMs) in solving Capture the Flag (CTF) challenges. The authors developed two workflows: human-in-the-loop (HITL) and fully-automated, to assess LLMs' ability to solve CTF challenges. They collected human contestants' results on the same set of questions and found that LLMs achieved a higher success rate than an average human participant. The study provides a comprehensive evaluation of LLMs' capabilities in solving real-world CTF challenges, from real competition to fully automated workflows. The results offer insights for applying LLMs in cybersecurity education and pave the way for systematic evaluation of offensive cybersecurity capabilities in LLMs.
The study involved six LLMs: GPT-3.5, GPT-4, Claude, Bard, DeepSeek Coder, and Mixtral. However, in the study involving human participants, all teams used ChatGPT, which performed comparably to an average-performing human CTF team. The authors built two workflows for solving CTF questions using LLMs and presented their success rates. They also analyzed the typical shortcomings of LLMs when tackling CTF challenges, illustrating the limitations of relying solely on LLMs without human intervention.
The study found that ChatGPT solved 11 out of 21 questions in the fully-automated workflow, the highest among all LLMs. It was the preferred choice for all teams during the competition. However, it had limitations in code translations, calculation accuracy, and a tendency to offer general responses. Bard solved only 2 questions correctly and was the only LLM in the experiment that yielded a null return value. Claude solved 6 out of 21 challenges, demonstrating accurate understanding of the problems for more than half of the total challenge database. DeepSeek Coder solved one more text-only question than Claude and was able to fix its answer on the second or third output. Mixtral performed worse than the other LLMs, with a lower success rate in solving challenges.
The study also compared the performance of LLMs against human CTF teams. GPT-4 outperformed 88.5% of human CTF players in the real-world CTF competition. The results show that LLMs have significant potential to play a role in CTF competitions comparable to a human CTF player. However, the study also highlights the importance of human expertise in solving CTF challenges, as providing human feedback to LLMs can significantly decrease failures and boost their accuracy. LLMs are valuable in helping users solve and understand CTF challenges but are not yet ready to replace human expertise.