[slides and audio] Defending Jailbreak Prompts via In-Context Adversarial Game

The paper introduces the In-Context Adversarial Game (ICAG) to defend against jailbreak attacks on Large Language Models (LLMs) without requiring fine-tuning. ICAG leverages agent learning to conduct an adversarial game, dynamically extending knowledge to enhance both attack and defense agents. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to continuously improve defenses against new jailbreak prompts. Empirical studies show that ICAG significantly reduces jailbreak success rates across various attack scenarios and demonstrates excellent transferability to other LLMs. The method involves an iterative refinement of attack prompts and enhancements to safety instructions, fostering a dynamic adversarial game where both attack and defense capabilities intensify with each cycle. The paper also discusses the limitations and ethical considerations of the approach, emphasizing the need for ongoing ethical review and collaboration to ensure responsible use.The paper introduces the In-Context Adversarial Game (ICAG) to defend against jailbreak attacks on Large Language Models (LLMs) without requiring fine-tuning. ICAG leverages agent learning to conduct an adversarial game, dynamically extending knowledge to enhance both attack and defense agents. Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to continuously improve defenses against new jailbreak prompts. Empirical studies show that ICAG significantly reduces jailbreak success rates across various attack scenarios and demonstrates excellent transferability to other LLMs. The method involves an iterative refinement of attack prompts and enhancements to safety instructions, fostering a dynamic adversarial game where both attack and defense capabilities intensify with each cycle. The paper also discusses the limitations and ethical considerations of the approach, emphasizing the need for ongoing ethical review and collaboration to ensure responsible use.

Defending Jailbreak Prompts via In-Context Adversarial Game

5 Jul 2024 | Yujun Zhou1, Yufei Han2, Haomin Zhuang1, Kehan Guo1, Zhenwen Liang1, Hongyan Bao3, and Xiangliang Zhang 2,3