RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

23 Jul 2024 | Huiyu Xu, Wenhui Zhang, Zhibo Wang, Feng Xiao, Rui Zheng, Yunhe Feng, Zhongjie Ba, and Kui Ren
RedAgent is a novel red teaming system designed to generate context-aware jailbreak prompts for large language models (LLMs). It addresses the limitations of existing red teaming methods, which lack automation, scalability, and the ability to adapt to different contexts. RedAgent leverages a "jailbreak strategy" framework to abstract and model existing attacks, enabling it to generate more effective jailbreak prompts. By self-reflecting on contextual feedback and red teaming trials in an additional memory buffer called Skill Memory, RedAgent continuously learns and refines its strategies to achieve more effective jailbreaks in specific contexts. Extensive experiments show that RedAgent can jailbreak most black-box LLMs within five queries, improving the efficiency of existing red teaming methods by two times. Additionally, RedAgent can efficiently jailbreak customized LLM applications, discovering 60 severe vulnerabilities in real-world applications with only two queries per vulnerability. The system has been reported to OpenAI and Meta for bug fixes, and results indicate that LLM applications enhanced with external data or tools are more vulnerable to jailbreak attacks than foundation models. RedAgent's contributions include a novel context-aware jailbreak prompt generation technique, an automated and efficient red teaming method, and a detailed analysis of vulnerabilities across different LLMs and scenarios. The system demonstrates high efficiency and effectiveness in identifying jailbreak vulnerabilities, with a success rate exceeding 90% in most cases. RedAgent's architecture includes three main stages: Context-aware Profiling, Adaptive Jailbreak Planning, and Attacking and Reflection, which work together to generate context-aware jailbreak prompts and refine them based on feedback. The system's evaluation shows that it can efficiently identify vulnerabilities in various LLMs and applications, with a high success rate and low number of queries required. RedAgent's ability to adapt to different scenarios and continuously learn from past experiences makes it a powerful tool for red teaming LLMs.RedAgent is a novel red teaming system designed to generate context-aware jailbreak prompts for large language models (LLMs). It addresses the limitations of existing red teaming methods, which lack automation, scalability, and the ability to adapt to different contexts. RedAgent leverages a "jailbreak strategy" framework to abstract and model existing attacks, enabling it to generate more effective jailbreak prompts. By self-reflecting on contextual feedback and red teaming trials in an additional memory buffer called Skill Memory, RedAgent continuously learns and refines its strategies to achieve more effective jailbreaks in specific contexts. Extensive experiments show that RedAgent can jailbreak most black-box LLMs within five queries, improving the efficiency of existing red teaming methods by two times. Additionally, RedAgent can efficiently jailbreak customized LLM applications, discovering 60 severe vulnerabilities in real-world applications with only two queries per vulnerability. The system has been reported to OpenAI and Meta for bug fixes, and results indicate that LLM applications enhanced with external data or tools are more vulnerable to jailbreak attacks than foundation models. RedAgent's contributions include a novel context-aware jailbreak prompt generation technique, an automated and efficient red teaming method, and a detailed analysis of vulnerabilities across different LLMs and scenarios. The system demonstrates high efficiency and effectiveness in identifying jailbreak vulnerabilities, with a success rate exceeding 90% in most cases. RedAgent's architecture includes three main stages: Context-aware Profiling, Adaptive Jailbreak Planning, and Attacking and Reflection, which work together to generate context-aware jailbreak prompts and refine them based on feedback. The system's evaluation shows that it can efficiently identify vulnerabilities in various LLMs and applications, with a high success rate and low number of queries required. RedAgent's ability to adapt to different scenarios and continuously learn from past experiences makes it a powerful tool for red teaming LLMs.
Reach us at info@study.space