30 Apr 2024 | Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas
This paper explores the use of large language models (LLMs) in deobfuscating malicious PowerShell scripts from the Emotet malware campaign. The study evaluates four LLMs—GPT-4, Gemini Pro, Code Llama, and Mixtral—to determine their effectiveness in extracting actionable intelligence from obfuscated code. The Emotet malware is known for its sophisticated obfuscation techniques, making it a suitable test case for this research.
The study finds that while LLMs are not yet perfect, they can efficiently deobfuscate malicious payloads. GPT-4 performed the best, correctly identifying 69.56% of URLs, followed by Gemini Pro at 36.84%. Code Llama and Mixtral performed poorly, with accuracy rates of 22.13% and 11.59%, respectively. However, when the task was simplified to extract domain names, all models showed significant improvement, with GPT-4 achieving up to 19.16% accuracy.
The study also highlights the challenges of using LLMs for deobfuscation, including hallucinations—where models generate incorrect or nonsensical outputs. These hallucinations are particularly prevalent in lower-performing models like Mixtral. Additionally, some models refused to perform the task, citing ethical concerns or the nature of the input.
The research suggests that LLMs can complement traditional deobfuscators by providing summaries of code and identifying MITRE ATT&CK methods. This can help in threat intelligence pipelines by automating the extraction of critical information from malicious scripts. However, the study also notes the need for further improvements in LLMs, including reducing hallucinations, expanding input size, and enhancing training methodologies.
The study concludes that while LLMs are not yet capable of fully replacing traditional deobfuscators, they have significant potential to enhance malware analysis and threat intelligence pipelines. Future work should focus on improving the accuracy and reliability of LLMs in deobfuscation tasks, particularly for complex and obfuscated code. The research also emphasizes the importance of transparency and ethical considerations in AI development, especially in the context of cybersecurity.This paper explores the use of large language models (LLMs) in deobfuscating malicious PowerShell scripts from the Emotet malware campaign. The study evaluates four LLMs—GPT-4, Gemini Pro, Code Llama, and Mixtral—to determine their effectiveness in extracting actionable intelligence from obfuscated code. The Emotet malware is known for its sophisticated obfuscation techniques, making it a suitable test case for this research.
The study finds that while LLMs are not yet perfect, they can efficiently deobfuscate malicious payloads. GPT-4 performed the best, correctly identifying 69.56% of URLs, followed by Gemini Pro at 36.84%. Code Llama and Mixtral performed poorly, with accuracy rates of 22.13% and 11.59%, respectively. However, when the task was simplified to extract domain names, all models showed significant improvement, with GPT-4 achieving up to 19.16% accuracy.
The study also highlights the challenges of using LLMs for deobfuscation, including hallucinations—where models generate incorrect or nonsensical outputs. These hallucinations are particularly prevalent in lower-performing models like Mixtral. Additionally, some models refused to perform the task, citing ethical concerns or the nature of the input.
The research suggests that LLMs can complement traditional deobfuscators by providing summaries of code and identifying MITRE ATT&CK methods. This can help in threat intelligence pipelines by automating the extraction of critical information from malicious scripts. However, the study also notes the need for further improvements in LLMs, including reducing hallucinations, expanding input size, and enhancing training methodologies.
The study concludes that while LLMs are not yet capable of fully replacing traditional deobfuscators, they have significant potential to enhance malware analysis and threat intelligence pipelines. Future work should focus on improving the accuracy and reliability of LLMs in deobfuscation tasks, particularly for complex and obfuscated code. The research also emphasizes the importance of transparency and ethical considerations in AI development, especially in the context of cybersecurity.