This paper explores the use of Large Language Models (LLMs) for detecting vulnerabilities in Android applications. Despite advancements in secure system development, Android apps still contain numerous vulnerabilities, necessitating effective detection methods. Traditional static and dynamic analysis tools have limitations, such as high false positive rates and limited scope. Machine learning approaches have been explored, but their real-world applicability is constrained by data requirements and feature engineering challenges. LLMs, with their vast parameters, show great potential in understanding both human and programming languages. The authors investigate the effectiveness of LLMs in detecting Android vulnerabilities and build an AI-driven workflow to assist developers in identifying and fixing vulnerabilities. Their experiments show that LLMs outperform expectations, correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. They also explore how different configurations affect True Positive (TP) and False Positive (FP) rates.
The study focuses on prompt engineering and retrieval-augmented generation (RAG) to enhance LLM performance. Prompt engineering involves designing prompts to guide LLMs in specific tasks, while RAG allows LLMs to use external knowledge to improve accuracy. The authors use the Ghera benchmark, which contains applications with known vulnerabilities, to evaluate their approach. They find that providing summaries of vulnerabilities improves detection accuracy and reduces the time required for inference. However, the model can sometimes misclassify code due to overcautiousness or lack of context.
The authors develop a Python package called "LLB" that can be used to scan Android applications for security vulnerabilities. The package includes a command-line interface and supports multiple scanners, including GHERA and VULDROID. The LLB package successfully identifies 6 out of 8 vulnerabilities in the Vulldroid case study. The results show that LLMs can be effective in detecting Android vulnerabilities, but they require careful prompt engineering and context to avoid false positives.
The study highlights the potential of LLMs in software engineering, but also acknowledges the challenges, such as the need for fine-tuning and the risk of bias in prompts. The authors suggest that future work should focus on improving the accuracy of LLMs by incorporating more context and refining the analysis pipeline. The study also emphasizes the importance of combining LLMs with static analysis to enhance the effectiveness of vulnerability detection in Android applications.This paper explores the use of Large Language Models (LLMs) for detecting vulnerabilities in Android applications. Despite advancements in secure system development, Android apps still contain numerous vulnerabilities, necessitating effective detection methods. Traditional static and dynamic analysis tools have limitations, such as high false positive rates and limited scope. Machine learning approaches have been explored, but their real-world applicability is constrained by data requirements and feature engineering challenges. LLMs, with their vast parameters, show great potential in understanding both human and programming languages. The authors investigate the effectiveness of LLMs in detecting Android vulnerabilities and build an AI-driven workflow to assist developers in identifying and fixing vulnerabilities. Their experiments show that LLMs outperform expectations, correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. They also explore how different configurations affect True Positive (TP) and False Positive (FP) rates.
The study focuses on prompt engineering and retrieval-augmented generation (RAG) to enhance LLM performance. Prompt engineering involves designing prompts to guide LLMs in specific tasks, while RAG allows LLMs to use external knowledge to improve accuracy. The authors use the Ghera benchmark, which contains applications with known vulnerabilities, to evaluate their approach. They find that providing summaries of vulnerabilities improves detection accuracy and reduces the time required for inference. However, the model can sometimes misclassify code due to overcautiousness or lack of context.
The authors develop a Python package called "LLB" that can be used to scan Android applications for security vulnerabilities. The package includes a command-line interface and supports multiple scanners, including GHERA and VULDROID. The LLB package successfully identifies 6 out of 8 vulnerabilities in the Vulldroid case study. The results show that LLMs can be effective in detecting Android vulnerabilities, but they require careful prompt engineering and context to avoid false positives.
The study highlights the potential of LLMs in software engineering, but also acknowledges the challenges, such as the need for fine-tuning and the risk of bias in prompts. The authors suggest that future work should focus on improving the accuracy of LLMs by incorporating more context and refining the analysis pipeline. The study also emphasizes the importance of combining LLMs with static analysis to enhance the effectiveness of vulnerability detection in Android applications.