[slides and audio] Distract Large Language Models for Automatic Jailbreak Attack

The paper introduces TASTLE, a novel black-box jailbreak framework designed to automate red teaming of large language models (LLMs). The primary goal is to enhance LLMs by identifying and addressing their vulnerabilities to malicious manipulations, such as jailbreaking, which can lead to unintended behaviors. TASTLE employs a distraction-based approach, inspired by the *distractibility* and *over-confidence* phenomena of LLMs, to generate effective, coherent, and fluent jailbreak prompts. The framework consists of three key components: malicious content concealing, memory-reframing, and iterative prompt optimization. Malicious content is concealed within complex and unrelated scenarios to reduce the LLM's ability to detect and reject malicious requests. The memory-reframing mechanism ensures that the LLM focuses on the malicious request by instructing it to shift its attention away from the main task. The iterative optimization algorithm iteratively generates and refines jailbreak templates using an attacker LLM, a target LLM, and a judgement model. Extensive experiments on both open-source and proprietary LLMs demonstrate the effectiveness, scalability, and transferability of TASTLE. The paper also evaluates existing jailbreak defense methods and highlights the need for more effective and practical defense strategies.The paper introduces TASTLE, a novel black-box jailbreak framework designed to automate red teaming of large language models (LLMs). The primary goal is to enhance LLMs by identifying and addressing their vulnerabilities to malicious manipulations, such as jailbreaking, which can lead to unintended behaviors. TASTLE employs a distraction-based approach, inspired by the *distractibility* and *over-confidence* phenomena of LLMs, to generate effective, coherent, and fluent jailbreak prompts. The framework consists of three key components: malicious content concealing, memory-reframing, and iterative prompt optimization. Malicious content is concealed within complex and unrelated scenarios to reduce the LLM's ability to detect and reject malicious requests. The memory-reframing mechanism ensures that the LLM focuses on the malicious request by instructing it to shift its attention away from the main task. The iterative optimization algorithm iteratively generates and refines jailbreak templates using an attacker LLM, a target LLM, and a judgement model. Extensive experiments on both open-source and proprietary LLMs demonstrate the effectiveness, scalability, and transferability of TASTLE. The paper also evaluates existing jailbreak defense methods and highlights the need for more effective and practical defense strategies.

TASTLE: Distract Large Language Models for Automatic Jailbreak Attack

13 Mar 2024 | Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen