Defending Jailbreak Prompts via In-Context Adversarial Game

Defending Jailbreak Prompts via In-Context Adversarial Game

5 Jul 2024 | Yujun Zhou¹, Yufei Han², Haomin Zhuang¹, Kehan Guo¹, Zhenwen Liang¹, Hongyan Bao³, and Xiangliang Zhang ¹
This paper introduces the In-Context Adversarial Game (ICAG) as a novel framework for defending against jailbreak attacks on Large Language Models (LLMs) without requiring fine-tuning. ICAG leverages agent learning to conduct an adversarial game between an attack agent and a defense agent, aiming to dynamically enhance the model's ability to resist jailbreak prompts. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to improve both the attack and defense agents, enabling continuous adaptation to new jailbreak prompts. The defense agent generates system prompts to counter jailbreak attempts by reflecting on both successful and failed attempts, while the attack agent refines jailbreak prompts by analyzing successful and failed attempts to extract insights. This adversarial dynamic game enhances both attack and defense capabilities over time. The paper evaluates ICAG's effectiveness through experiments on various datasets, including AdvBench and Self Reminder Data (SRD). Results show that ICAG significantly reduces the Jailbreak Success Rate (JSR) compared to baseline methods, with an average reduction of 7.99%. Additionally, ICAG demonstrates strong transferability across different LLMs, with only a slight increase in JSR when applied to other models. The method is also effective in reducing over-defensiveness, as evidenced by lower over-defense rates on the Xstest dataset. The paper also discusses the limitations of ICAG, including its reliance on a relatively static adversary model and the quality of the initial prompt set. Future work could address these limitations by exploring more scalable strategies and extending the framework to multimodal contexts. Ethical considerations are also raised, emphasizing the importance of responsible development and use of LLMs to prevent their misuse. Overall, ICAG provides a promising approach to defending against jailbreak attacks without the need for extensive retraining, offering a versatile and adaptable defense mechanism for LLMs.This paper introduces the In-Context Adversarial Game (ICAG) as a novel framework for defending against jailbreak attacks on Large Language Models (LLMs) without requiring fine-tuning. ICAG leverages agent learning to conduct an adversarial game between an attack agent and a defense agent, aiming to dynamically enhance the model's ability to resist jailbreak prompts. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to improve both the attack and defense agents, enabling continuous adaptation to new jailbreak prompts. The defense agent generates system prompts to counter jailbreak attempts by reflecting on both successful and failed attempts, while the attack agent refines jailbreak prompts by analyzing successful and failed attempts to extract insights. This adversarial dynamic game enhances both attack and defense capabilities over time. The paper evaluates ICAG's effectiveness through experiments on various datasets, including AdvBench and Self Reminder Data (SRD). Results show that ICAG significantly reduces the Jailbreak Success Rate (JSR) compared to baseline methods, with an average reduction of 7.99%. Additionally, ICAG demonstrates strong transferability across different LLMs, with only a slight increase in JSR when applied to other models. The method is also effective in reducing over-defensiveness, as evidenced by lower over-defense rates on the Xstest dataset. The paper also discusses the limitations of ICAG, including its reliance on a relatively static adversary model and the quality of the initial prompt set. Future work could address these limitations by exploring more scalable strategies and extending the framework to multimodal contexts. Ethical considerations are also raised, emphasizing the importance of responsible development and use of LLMs to prevent their misuse. Overall, ICAG provides a promising approach to defending against jailbreak attacks without the need for extensive retraining, offering a versatile and adaptable defense mechanism for LLMs.
Reach us at info@study.space
[slides and audio] Defending Jailbreak Prompts via In-Context Adversarial Game