Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection

Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection

27 June 2024 | Preeti Thakur¹ · Vineet Kansal² · Vinay Rishiwal³
This paper presents a hybrid deep learning approach combining Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) for malware detection. The method converts malware binaries into grayscale images and analyzes them using CNN-LSTM networks. Dynamic features are extracted, ranked, and reduced using Principal Component Analysis (PCA). Various classifiers are used, with final classification achieved through a voting scheme, providing a robust solution for accurate malware family classification. The approach processes binary code inputs, with LSTM capturing temporal dependencies and CNN performing parallel feature extraction. PCA is used for feature selection, reducing computational time. The method was evaluated on a public malware dataset and network traffic, demonstrating state-of-the-art performance in identifying various malware families. It significantly reduces the resources required for manual analysis and improves system security. The approach achieved high precision, recall, accuracy, and F1 score, outperforming existing methods. Future research directions include improving feature extraction techniques and developing real-time detection models. Keywords: Malware, CNN, LSTM, Hybrid model, Image analysis, Machine learning. Malware is a program designed to harm or exploit computer systems or networks. It can range from annoying adware to destructive viruses and worms that can cause significant damage to both personal and enterprise systems. The increasing sophistication of malware attacks requires advanced techniques to analyze and understand the nature of the threat. According to a report by AV-Test, over 350,000 new malware samples are discovered daily. This highlights the need for effective and efficient malware analysis techniques to keep up with the constantly evolving threat landscape. Malware analysis examines and investigates malicious software to identify its behavior, characteristics, and potential impact. Analysts use several techniques to analyze malware, including static, dynamic, and hybrid analyses. Solutions to detect malware based on deep learning, cloud-based, network-based, and graph-based models are available. Hybrid analysis combines both static/dynamic analysis methods to provide a comprehensive understanding of the malware. Static analysis involves investigating the structure and code of the malware without executing it. It includes methods such as disassembly, decompilation, and code analysis. A two-phase training deep neural network serves as the foundation for the Static Analysis subsystem, with an unsupervised pre-training phase using stacked denoising autoencoders guided by supervised fine-tuning through backpropagation. The features are extracted by accessing the Portable Executable (PE) packaging's Disk Operating System (DOS) Header, Section Table, Optional Header, and File Header. A compact description of the field information is generated by treating simple numbers as unsigned integers. Other input types, like timestamps, arrays, and texts, are handled using a hashing algorithm. Using offset variables is necessary to preserve spatial knowledge connected to the elements.This paper presents a hybrid deep learning approach combining Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN) for malware detection. The method converts malware binaries into grayscale images and analyzes them using CNN-LSTM networks. Dynamic features are extracted, ranked, and reduced using Principal Component Analysis (PCA). Various classifiers are used, with final classification achieved through a voting scheme, providing a robust solution for accurate malware family classification. The approach processes binary code inputs, with LSTM capturing temporal dependencies and CNN performing parallel feature extraction. PCA is used for feature selection, reducing computational time. The method was evaluated on a public malware dataset and network traffic, demonstrating state-of-the-art performance in identifying various malware families. It significantly reduces the resources required for manual analysis and improves system security. The approach achieved high precision, recall, accuracy, and F1 score, outperforming existing methods. Future research directions include improving feature extraction techniques and developing real-time detection models. Keywords: Malware, CNN, LSTM, Hybrid model, Image analysis, Machine learning. Malware is a program designed to harm or exploit computer systems or networks. It can range from annoying adware to destructive viruses and worms that can cause significant damage to both personal and enterprise systems. The increasing sophistication of malware attacks requires advanced techniques to analyze and understand the nature of the threat. According to a report by AV-Test, over 350,000 new malware samples are discovered daily. This highlights the need for effective and efficient malware analysis techniques to keep up with the constantly evolving threat landscape. Malware analysis examines and investigates malicious software to identify its behavior, characteristics, and potential impact. Analysts use several techniques to analyze malware, including static, dynamic, and hybrid analyses. Solutions to detect malware based on deep learning, cloud-based, network-based, and graph-based models are available. Hybrid analysis combines both static/dynamic analysis methods to provide a comprehensive understanding of the malware. Static analysis involves investigating the structure and code of the malware without executing it. It includes methods such as disassembly, decompilation, and code analysis. A two-phase training deep neural network serves as the foundation for the Static Analysis subsystem, with an unsupervised pre-training phase using stacked denoising autoencoders guided by supervised fine-tuning through backpropagation. The features are extracted by accessing the Portable Executable (PE) packaging's Disk Operating System (DOS) Header, Section Table, Optional Header, and File Header. A compact description of the field information is generated by treating simple numbers as unsigned integers. Other input types, like timestamps, arrays, and texts, are handled using a hashing algorithm. Using offset variables is necessary to preserve spatial knowledge connected to the elements.
Reach us at info@futurestudyspace.com
Understanding Hybrid Deep Learning Approach Based on LSTM and CNN for Malware Detection