Understanding CTIBench%3A A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

CTIBench is a benchmark designed to evaluate the performance of Large Language Models (LLMs) in Cyber Threat Intelligence (CTI) tasks. The paper introduces CTIBench, which includes multiple datasets and tasks to assess LLMs' understanding of CTI standards, threats, detection strategies, mitigation plans, and best practices. The benchmark includes tasks such as CTI-MCQ (multiple-choice questions on CTI knowledge), CTI-RCM (root cause mapping of vulnerabilities), CTI-VSP (vulnerability severity prediction), and CTI-TAA (threat actor attribution). The authors evaluate several state-of-the-art LLMs on these tasks and provide insights into their strengths and weaknesses in CTI contexts. The benchmark aims to address the limitations of existing benchmarks by focusing on practical CTI tasks and providing a comprehensive evaluation of LLM capabilities in CTI. The datasets and code are publicly available. The results show that GPT-4 performs well across most tasks, while Gemini-1.5 excels in CTI-VSP. LLAMA3-70B performs comparably to Gemini-1.5 and outperforms it on some tasks. The study highlights the importance of evaluating LLMs in CTI and provides a framework for future research in this area. The benchmark also addresses ethical concerns by using publicly available threat information and avoiding personal or sensitive data. The paper acknowledges limitations, including the focus on English-language CTI techniques and the limited scope of evaluated tasks. Overall, CTIBench provides a valuable tool for assessing LLMs in CTI and contributes to the understanding of their capabilities in this domain.CTIBench is a benchmark designed to evaluate the performance of Large Language Models (LLMs) in Cyber Threat Intelligence (CTI) tasks. The paper introduces CTIBench, which includes multiple datasets and tasks to assess LLMs' understanding of CTI standards, threats, detection strategies, mitigation plans, and best practices. The benchmark includes tasks such as CTI-MCQ (multiple-choice questions on CTI knowledge), CTI-RCM (root cause mapping of vulnerabilities), CTI-VSP (vulnerability severity prediction), and CTI-TAA (threat actor attribution). The authors evaluate several state-of-the-art LLMs on these tasks and provide insights into their strengths and weaknesses in CTI contexts. The benchmark aims to address the limitations of existing benchmarks by focusing on practical CTI tasks and providing a comprehensive evaluation of LLM capabilities in CTI. The datasets and code are publicly available. The results show that GPT-4 performs well across most tasks, while Gemini-1.5 excels in CTI-VSP. LLAMA3-70B performs comparably to Gemini-1.5 and outperforms it on some tasks. The study highlights the importance of evaluating LLMs in CTI and provides a framework for future research in this area. The benchmark also addresses ethical concerns by using publicly available threat information and avoiding personal or sensitive data. The paper acknowledges limitations, including the focus on English-language CTI techniques and the limited scope of evaluated tasks. Overall, CTIBench provides a valuable tool for assessing LLMs in CTI and contributes to the understanding of their capabilities in this domain.

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

24 Jun 2024 | Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, Nidhi Rastogi