CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence

24 Jun 2024 | Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, Nidhi Rastogi
Cyber Threat Intelligence (CTI) is crucial for understanding and mitigating evolving cyber threats. Large Language Models (LLMs) have shown potential in this domain, but their reliability and accuracy remain concerns. Existing benchmarks evaluate LLMs generally, lacking specific CTI tasks. To address this gap, the authors introduce CTIBench, a benchmark designed to assess LLMs' performance in CTI applications. CTIBench includes multiple datasets focused on evaluating LLMs' knowledge in the cyber-threat landscape. The evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses, contributing to a better understanding of LLM capabilities in CTI. The benchmark covers four fundamental cognitive capabilities: memorization, understanding, problem-solving, and reasoning. It includes tasks such as multiple-choice questions (CTI-MCQ), root cause mapping (CTI-RCM), vulnerability severity prediction (CTI-VSP), and threat actor attribution (CTI-TAA). The results highlight the performance of different LLMs and suggest areas for future research. CTIBench aims to accelerate incident response by automating the triage and analysis of security alerts, reducing response time. The datasets and code are publicly available at <https://github.com/xashru/cti-bench>.Cyber Threat Intelligence (CTI) is crucial for understanding and mitigating evolving cyber threats. Large Language Models (LLMs) have shown potential in this domain, but their reliability and accuracy remain concerns. Existing benchmarks evaluate LLMs generally, lacking specific CTI tasks. To address this gap, the authors introduce CTIBench, a benchmark designed to assess LLMs' performance in CTI applications. CTIBench includes multiple datasets focused on evaluating LLMs' knowledge in the cyber-threat landscape. The evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses, contributing to a better understanding of LLM capabilities in CTI. The benchmark covers four fundamental cognitive capabilities: memorization, understanding, problem-solving, and reasoning. It includes tasks such as multiple-choice questions (CTI-MCQ), root cause mapping (CTI-RCM), vulnerability severity prediction (CTI-VSP), and threat actor attribution (CTI-TAA). The results highlight the performance of different LLMs and suggest areas for future research. CTIBench aims to accelerate incident response by automating the triage and analysis of security alerts, reducing response time. The datasets and code are publicly available at <https://github.com/xashru/cti-bench>.
Reach us at info@study.space