13 Mar 2024 | Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen
TASTLE is a novel black-box jailbreak framework for automated red teaming of large language models (LLMs). The framework leverages the distractibility and over-confidence phenomena of LLMs to generate effective jailbreak prompts. It consists of three key components: malicious content concealing, memory-reframing, and iterative jailbreak template optimization. The malicious content is concealed within a complex and unrelated scenario, while the memory-reframing mechanism distracts the model's attention away from the malicious task. The iterative optimization algorithm improves the effectiveness of the jailbreak prompts by using an attacker LLM, target LLM, and judgment model to iteratively refine the jailbreak templates. Extensive experiments on both open-source and proprietary LLMs demonstrate the superiority of TASTLE in terms of effectiveness, scalability, and transferability. The framework achieves high attack success rates on models such as ChatGPT and GPT-4. The research highlights the need for more effective and practical defense strategies against jailbreak attacks. The paper warns that the content generated by LLMs may be offensive to readers.TASTLE is a novel black-box jailbreak framework for automated red teaming of large language models (LLMs). The framework leverages the distractibility and over-confidence phenomena of LLMs to generate effective jailbreak prompts. It consists of three key components: malicious content concealing, memory-reframing, and iterative jailbreak template optimization. The malicious content is concealed within a complex and unrelated scenario, while the memory-reframing mechanism distracts the model's attention away from the malicious task. The iterative optimization algorithm improves the effectiveness of the jailbreak prompts by using an attacker LLM, target LLM, and judgment model to iteratively refine the jailbreak templates. Extensive experiments on both open-source and proprietary LLMs demonstrate the superiority of TASTLE in terms of effectiveness, scalability, and transferability. The framework achieves high attack success rates on models such as ChatGPT and GPT-4. The research highlights the need for more effective and practical defense strategies against jailbreak attacks. The paper warns that the content generated by LLMs may be offensive to readers.