Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

10 Jun 2024 | Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen
The paper "Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction" addresses the security vulnerabilities of large language models (LLMs) by introducing a novel black-box jailbreak method named DRA (Disguise and Reconstruction Attack). The authors identify biases in the safety fine-tuning process that make LLMs more susceptible to harmful responses, particularly when the harmful content appears in completions rather than queries. DRA leverages these biases by first disguising harmful instructions within queries and then guiding the model to reconstruct them in its completions. The attack is evaluated on various open-source and closed-source models, achieving high success rates, with a notable 91.1% success rate on OpenAI GPT-4. The paper also provides a detailed analysis of the LLMs' positional bias to harmful content and demonstrates the effectiveness of DRA through empirical experiments.The paper "Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction" addresses the security vulnerabilities of large language models (LLMs) by introducing a novel black-box jailbreak method named DRA (Disguise and Reconstruction Attack). The authors identify biases in the safety fine-tuning process that make LLMs more susceptible to harmful responses, particularly when the harmful content appears in completions rather than queries. DRA leverages these biases by first disguising harmful instructions within queries and then guiding the model to reconstruct them in its completions. The attack is evaluated on various open-source and closed-source models, achieving high success rates, with a notable 91.1% success rate on OpenAI GPT-4. The paper also provides a detailed analysis of the LLMs' positional bias to harmful content and demonstrates the effectiveness of DRA through empirical experiments.
Reach us at info@study.space
[slides and audio] Making Them Ask and Answer%3A Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction