21 May 2024 | Rebeka Tóth, Tamas Bisztray, László Erdődi
This study evaluates the security of web application code generated by Large Language Models (LLMs), specifically focusing on 2,500 GPT-4-generated PHP websites. The websites were deployed in Docker containers and tested for vulnerabilities using a hybrid approach of Burp Suite active scanning, static analysis, and manual review. The investigation identified vulnerabilities such as Insecure File Upload, SQL Injection, Stored XSS, and Reflected XSS in the generated PHP code. Key findings include:
- **Vulnerability Detection**: 2,440 vulnerable parameters were identified, with 11.56% of the sites potentially compromised by Burp's active scan.
- **Static Analysis**: 26% of the sites had at least one exploitable vulnerability through web interaction.
- **File Upload Insecurity**: 78% of sites with file upload functionality were vulnerable.
- **SQL Injection**: 54.28% of sites using SQL queries lacked prepared statements, leaving them open to injection attacks.
- **Manual Code Audit**: 38% of randomly selected sites had vulnerable parameters, highlighting the limitations of automated tools.
The study emphasizes the need for thorough testing and evaluation when using generative AI technologies in software development. The dataset and vulnerability labels are publicly available on GitHub to support further research. The research concludes that GPT-4 is highly susceptible to generating PHP code containing significant security risks, and the generated code often lacks the complexity and realism suitable for real-world deployment. Future research directions include expanding the dataset, comparing different LLM models, and enhancing vulnerability scanning methods.This study evaluates the security of web application code generated by Large Language Models (LLMs), specifically focusing on 2,500 GPT-4-generated PHP websites. The websites were deployed in Docker containers and tested for vulnerabilities using a hybrid approach of Burp Suite active scanning, static analysis, and manual review. The investigation identified vulnerabilities such as Insecure File Upload, SQL Injection, Stored XSS, and Reflected XSS in the generated PHP code. Key findings include:
- **Vulnerability Detection**: 2,440 vulnerable parameters were identified, with 11.56% of the sites potentially compromised by Burp's active scan.
- **Static Analysis**: 26% of the sites had at least one exploitable vulnerability through web interaction.
- **File Upload Insecurity**: 78% of sites with file upload functionality were vulnerable.
- **SQL Injection**: 54.28% of sites using SQL queries lacked prepared statements, leaving them open to injection attacks.
- **Manual Code Audit**: 38% of randomly selected sites had vulnerable parameters, highlighting the limitations of automated tools.
The study emphasizes the need for thorough testing and evaluation when using generative AI technologies in software development. The dataset and vulnerability labels are publicly available on GitHub to support further research. The research concludes that GPT-4 is highly susceptible to generating PHP code containing significant security risks, and the generated code often lacks the complexity and realism suitable for real-world deployment. Future research directions include expanding the dataset, comparing different LLM models, and enhancing vulnerability scanning methods.