[slides and audio] Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

The paper explores the capabilities of large language models (LLMs) in deobfuscating malicious code, specifically focusing on real-world malware campaigns. The authors evaluate four state-of-the-art LLMs—GPT-4, Gemini Pro, Code Llama, and Mixtral—using a dataset of 2,000 obfuscated PowerShell scripts from the Emotet malware campaign. The goal is to extract actionable intelligence from the obfuscated code, such as URLs and command-and-control servers. The results indicate that while the best-performing LLM, GPT-4, correctly identified 69.56% of the URLs, the performance of local LLMs was significantly lower, with Code Llama achieving only 22.13% accuracy and Mixtral achieving 11.59%. The study also highlights the prevalence of hallucinations, where LLMs generated incorrect or irrelevant outputs. Despite these limitations, the authors suggest that LLMs can be integrated into existing cyber threat intelligence pipelines to complement traditional deobfuscators, particularly in scenarios where malware authors frequently change their code and tools. The paper concludes by discussing the potential of LLMs in improving malware analysis and the need for further research to address issues such as hallucinations, input size, and training methodologies. The authors propose a pipeline that combines LLMs with traditional deobfuscators to enhance the accuracy and effectiveness of threat intelligence.The paper explores the capabilities of large language models (LLMs) in deobfuscating malicious code, specifically focusing on real-world malware campaigns. The authors evaluate four state-of-the-art LLMs—GPT-4, Gemini Pro, Code Llama, and Mixtral—using a dataset of 2,000 obfuscated PowerShell scripts from the Emotet malware campaign. The goal is to extract actionable intelligence from the obfuscated code, such as URLs and command-and-control servers. The results indicate that while the best-performing LLM, GPT-4, correctly identified 69.56% of the URLs, the performance of local LLMs was significantly lower, with Code Llama achieving only 22.13% accuracy and Mixtral achieving 11.59%. The study also highlights the prevalence of hallucinations, where LLMs generated incorrect or irrelevant outputs. Despite these limitations, the authors suggest that LLMs can be integrated into existing cyber threat intelligence pipelines to complement traditional deobfuscators, particularly in scenarios where malware authors frequently change their code and tools. The paper concludes by discussing the potential of LLMs in improving malware analysis and the need for further research to address issues such as hallucinations, input size, and training methodologies. The authors propose a pipeline that combines LLMs with traditional deobfuscators to enhance the accuracy and effectiveness of threat intelligence.

Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns

30 Apr 2024 | Constantinos Patsakis, Fran Casino, Nikolaos Lykousas