Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection

Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection

Year 2024 | Zhenglin Li, Haibei Zhu, Houze Liu, Jintong Song, Qishuo Cheng
This study evaluates the effectiveness of various machine learning models in detecting malware using the Mal-API-2019 dataset. The research focuses on ensemble and non-ensemble methods, including Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks. The study emphasizes the importance of data preprocessing techniques such as TF-IDF and Principal Component Analysis (PCA) in improving model performance. Results show that ensemble methods, particularly Random Forest and XGBoost, achieve higher accuracy, precision, and recall compared to other models, highlighting their effectiveness in malware detection. The paper also discusses the limitations of current approaches and suggests future research directions, emphasizing the need for continuous adaptation to evolving malware threats. The Mal-API-2019 dataset, derived from Cuckoo Sandbox, contains eight malware categories and focuses on Windows API calls. Data preprocessing involves frequency and temporal embeddings, as well as TF-IDF and PCA for feature representation. The study explores the use of these techniques to enhance model performance in detecting malware. The research compares the performance of different models, with Random Forest and XGBoost showing superior results. XGBoost slightly outperforms Random Forest in precision, while both models demonstrate similar recall rates. KNN and Neural Networks show lower performance, indicating challenges in handling complex malware data. The study concludes that ensemble methods are effective in malware detection, but non-ensemble models still offer valuable insights, especially in terms of interpretability and computational efficiency. The research highlights the importance of data preprocessing in improving model performance and suggests future work in expanding datasets and integrating deep learning techniques like LSTM for more accurate detection. The study contributes to ongoing discussions in cybersecurity by providing practical insights for developing robust malware detection systems.This study evaluates the effectiveness of various machine learning models in detecting malware using the Mal-API-2019 dataset. The research focuses on ensemble and non-ensemble methods, including Random Forest, XGBoost, K Nearest Neighbor (KNN), and Neural Networks. The study emphasizes the importance of data preprocessing techniques such as TF-IDF and Principal Component Analysis (PCA) in improving model performance. Results show that ensemble methods, particularly Random Forest and XGBoost, achieve higher accuracy, precision, and recall compared to other models, highlighting their effectiveness in malware detection. The paper also discusses the limitations of current approaches and suggests future research directions, emphasizing the need for continuous adaptation to evolving malware threats. The Mal-API-2019 dataset, derived from Cuckoo Sandbox, contains eight malware categories and focuses on Windows API calls. Data preprocessing involves frequency and temporal embeddings, as well as TF-IDF and PCA for feature representation. The study explores the use of these techniques to enhance model performance in detecting malware. The research compares the performance of different models, with Random Forest and XGBoost showing superior results. XGBoost slightly outperforms Random Forest in precision, while both models demonstrate similar recall rates. KNN and Neural Networks show lower performance, indicating challenges in handling complex malware data. The study concludes that ensemble methods are effective in malware detection, but non-ensemble models still offer valuable insights, especially in terms of interpretability and computational efficiency. The research highlights the importance of data preprocessing in improving model performance and suggests future work in expanding datasets and integrating deep learning techniques like LSTM for more accurate detection. The study contributes to ongoing discussions in cybersecurity by providing practical insights for developing robust malware detection systems.
Reach us at info@study.space
[slides] Comprehensive evaluation of Mal-API-2019 dataset by machine learning in malware detection | StudySpace