RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

23 Jul 2024 | Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren
**RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent** Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, Kui Ren *The State Key Laboratory of Blockchain and Data Security, Zhejiang University, P. R. China* *School of Cyber Science and Technology, Zhejiang University, P. R. China* *Palo Alto Networks, USA* *Department of Computer Science and Engineering, University of North Texas, USA* {huiyuxu, zhibowang, zr_12f, zhongjieba, kuren}@zju.edu.cn, wenhuizhang1222@gmail.com, fxiao@paloaltoneetworks.com, Yunhe.Feng@unt.edu* **Abstract** Advanced Large Language Models (LLMs) like GPT-4 have been integrated into various real-world applications, expanding their attack surface and exposing them to threats such as jailbreak attacks. Existing red teaming methods struggle to adapt to different scenarios and lack efficiency in generating context-aware jailbreak prompts. To address these issues, we propose RedAgent, a multi-agent LLM system that leverages "jailbreak strategies" to generate context-aware prompts. RedAgent continuously learns from contextual feedback and trials, improving its effectiveness in specific contexts. Extensive experiments show that RedAgent can jailbreak most black-box LLMs with only five queries, improving efficiency by two times compared to existing methods. Additionally, RedAgent can efficiently discover severe vulnerabilities in customized LLM applications, demonstrating the system's robustness and adaptability. Our findings highlight the importance of context-awareness and automation in red teaming methods, providing valuable insights for enhancing the security of LLM-based applications.**RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent** Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, Kui Ren *The State Key Laboratory of Blockchain and Data Security, Zhejiang University, P. R. China* *School of Cyber Science and Technology, Zhejiang University, P. R. China* *Palo Alto Networks, USA* *Department of Computer Science and Engineering, University of North Texas, USA* {huiyuxu, zhibowang, zr_12f, zhongjieba, kuren}@zju.edu.cn, wenhuizhang1222@gmail.com, fxiao@paloaltoneetworks.com, Yunhe.Feng@unt.edu* **Abstract** Advanced Large Language Models (LLMs) like GPT-4 have been integrated into various real-world applications, expanding their attack surface and exposing them to threats such as jailbreak attacks. Existing red teaming methods struggle to adapt to different scenarios and lack efficiency in generating context-aware jailbreak prompts. To address these issues, we propose RedAgent, a multi-agent LLM system that leverages "jailbreak strategies" to generate context-aware prompts. RedAgent continuously learns from contextual feedback and trials, improving its effectiveness in specific contexts. Extensive experiments show that RedAgent can jailbreak most black-box LLMs with only five queries, improving efficiency by two times compared to existing methods. Additionally, RedAgent can efficiently discover severe vulnerabilities in customized LLM applications, demonstrating the system's robustness and adaptability. Our findings highlight the importance of context-awareness and automation in red teaming methods, providing valuable insights for enhancing the security of LLM-based applications.
Reach us at info@study.space