[slides] Evaluation of LLM Chatbots for OSINT-based Cyberthreat Awareness

This study evaluates the performance of several Large Language Models (LLMs) as chatbots in the context of Open Source Intelligence (OSINT) for Cyber Threat Intelligence (CTI). The authors, Samaneh Shafee, Alysson Bessani, and Pedro M. Ferreira, focus on binary classification and Named Entity Recognition (NER) tasks using a dataset from Twitter. They compare the performance of ChatGPT, GPT4all, Dolly, Stanford Alpaca, Alpaca-LoRA, Falcon, and Vicuna chatbots against specialized models trained for these tasks. Key findings include: - **Binary Classification**: ChatGPT achieved an F1 score of 0.94, while GPT4all scored 0.90. Other models, such as Dolly, Falcon, and Stanford Alpaca, performed less well. - **Named Entity Recognition**: All evaluated chatbots showed limitations in NER, indicating a need for further improvement. - **Evaluation Methodology**: The study uses a comprehensive Twitter dataset, with tweets labeled for cybersecurity relevance and named entities. The evaluation includes binary classification and NER tasks, with metrics like F1 score, precision, recall, and execution time. - **Optimal Utilization Strategies**: The study explores prompt engineering and text length control to optimize chatbot performance. Prompt engineering involves refining prompts to ensure clear and precise instructions, while text length control helps manage response lengths to reduce execution time. The study highlights the potential of LLM-based chatbots in OSINT but also underscores the need for further development to match the performance of specialized models in NER tasks. The results provide insights for researchers and practitioners to improve chatbot technology and integrate it more effectively into CTI tools.This study evaluates the performance of several Large Language Models (LLMs) as chatbots in the context of Open Source Intelligence (OSINT) for Cyber Threat Intelligence (CTI). The authors, Samaneh Shafee, Alysson Bessani, and Pedro M. Ferreira, focus on binary classification and Named Entity Recognition (NER) tasks using a dataset from Twitter. They compare the performance of ChatGPT, GPT4all, Dolly, Stanford Alpaca, Alpaca-LoRA, Falcon, and Vicuna chatbots against specialized models trained for these tasks. Key findings include: - **Binary Classification**: ChatGPT achieved an F1 score of 0.94, while GPT4all scored 0.90. Other models, such as Dolly, Falcon, and Stanford Alpaca, performed less well. - **Named Entity Recognition**: All evaluated chatbots showed limitations in NER, indicating a need for further improvement. - **Evaluation Methodology**: The study uses a comprehensive Twitter dataset, with tweets labeled for cybersecurity relevance and named entities. The evaluation includes binary classification and NER tasks, with metrics like F1 score, precision, recall, and execution time. - **Optimal Utilization Strategies**: The study explores prompt engineering and text length control to optimize chatbot performance. Prompt engineering involves refining prompts to ensure clear and precise instructions, while text length control helps manage response lengths to reduce execution time. The study highlights the potential of LLM-based chatbots in OSINT but also underscores the need for further development to match the performance of specialized models in NER tasks. The results provide insights for researchers and practitioners to improve chatbot technology and integrate it more effectively into CTI tools.

Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness

19 Apr 2024 | Samaneh Shafee*, Alysson Bessani, Pedro M. Ferreira