Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness

Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness

19 Apr 2024 | Samaneh Shafee, Alysson Bessani, Pedro M. Ferreira
This paper evaluates the performance of several large language model (LLM) chatbots in binary classification and Named Entity Recognition (NER) tasks for Open-Source Intelligence (OSINT)-based Cyber Threat Intelligence (CTI). The study compares the performance of commercial and open-source chatbots, including ChatGPT, GPT4all, Dolly, Stanford Alpaca, Alpaca-LoRA, Falcon, and Vicuna, using a publicly available annotated dataset from Twitter. The dataset contains 31,281 tweets, with each tweet labeled for cybersecurity relevance and named entities. The study assesses the chatbots' ability to classify tweets as relevant to cybersecurity and extract named entities such as organizations and product versions. In binary classification, ChatGPT achieved an F1 score of 0.94, while GPT4all achieved an F1 score of 0.90. However, in NER tasks, all chatbots showed limitations and were less effective compared to specialized models. The study highlights the potential of LLM-based chatbots for OSINT-based CTI but emphasizes the need for further improvements in NER to effectively replace specialized models. The results show that while LLM chatbots can perform well in binary classification, they require refinement in NER to be competitive with specialized models. The study also discusses the impact of different chatbot utilization methods on performance and the importance of prompt engineering and text length control in optimizing chatbot performance. The findings contribute to the understanding of the capabilities and limitations of LLM-based chatbots in cybersecurity applications and provide insights for improving chatbot technology to enhance OSINT-based CTI tools.This paper evaluates the performance of several large language model (LLM) chatbots in binary classification and Named Entity Recognition (NER) tasks for Open-Source Intelligence (OSINT)-based Cyber Threat Intelligence (CTI). The study compares the performance of commercial and open-source chatbots, including ChatGPT, GPT4all, Dolly, Stanford Alpaca, Alpaca-LoRA, Falcon, and Vicuna, using a publicly available annotated dataset from Twitter. The dataset contains 31,281 tweets, with each tweet labeled for cybersecurity relevance and named entities. The study assesses the chatbots' ability to classify tweets as relevant to cybersecurity and extract named entities such as organizations and product versions. In binary classification, ChatGPT achieved an F1 score of 0.94, while GPT4all achieved an F1 score of 0.90. However, in NER tasks, all chatbots showed limitations and were less effective compared to specialized models. The study highlights the potential of LLM-based chatbots for OSINT-based CTI but emphasizes the need for further improvements in NER to effectively replace specialized models. The results show that while LLM chatbots can perform well in binary classification, they require refinement in NER to be competitive with specialized models. The study also discusses the impact of different chatbot utilization methods on performance and the importance of prompt engineering and text length control in optimizing chatbot performance. The findings contribute to the understanding of the capabilities and limitations of LLM-based chatbots in cybersecurity applications and provide insights for improving chatbot technology to enhance OSINT-based CTI tools.
Reach us at info@study.space
Understanding Evaluation of LLM Chatbots for OSINT-based Cyberthreat Awareness