Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

2 Oct 2024 | Yixin Cheng¹, Markos Georgopoulus, Volkan Cevher¹, Grigorios G. Chrysos²
This paper introduces a new Jailbreaking attack called Contextual Interaction Attack (CIA), which leverages the prior context in conversations to elicit harmful information from Large Language Models (LLMs). Unlike traditional prompt-based Jailbreaking methods, CIA uses a sequence of benign preliminary questions to interact with the LLM, gradually aligning the context with the attack query. This approach exploits the autoregressive nature of LLMs, where previous conversation rounds are used as context during generation. By constructing a context that is semantically aligned with the attack query, CIA can effectively execute the attack. The attack begins with a series of harmless questions that, when combined, form a harmful context. These questions are generated using an auxiliary LLM, which synthesizes the sequence based on a few relevant examples. The attack then proceeds with multiple rounds of interaction, using the model's own responses to influence the context and ultimately execute the attack. Experiments on seven different LLMs demonstrate the effectiveness of CIA, which is black-box and transferable across models. CIA outperforms existing Jailbreaking methods such as GCG, TAP, and ICA in terms of success rate. Additionally, CIA exhibits strong transferability, where preliminary questions crafted for one LLM show high success rates when applied to other LLMs. The paper also evaluates the effectiveness of CIA against various defense strategies, including perplexity defense, paraphrasing defense, and SmoothLLM. Results show that CIA is relatively robust to these defenses, particularly when the attack prompts are semantically aligned with the target model. The study highlights the importance of context in Jailbreaking attacks and suggests that future research should explore the potential of leveraging context vectors to develop new attack mechanisms and deepen the understanding of LLM security.This paper introduces a new Jailbreaking attack called Contextual Interaction Attack (CIA), which leverages the prior context in conversations to elicit harmful information from Large Language Models (LLMs). Unlike traditional prompt-based Jailbreaking methods, CIA uses a sequence of benign preliminary questions to interact with the LLM, gradually aligning the context with the attack query. This approach exploits the autoregressive nature of LLMs, where previous conversation rounds are used as context during generation. By constructing a context that is semantically aligned with the attack query, CIA can effectively execute the attack. The attack begins with a series of harmless questions that, when combined, form a harmful context. These questions are generated using an auxiliary LLM, which synthesizes the sequence based on a few relevant examples. The attack then proceeds with multiple rounds of interaction, using the model's own responses to influence the context and ultimately execute the attack. Experiments on seven different LLMs demonstrate the effectiveness of CIA, which is black-box and transferable across models. CIA outperforms existing Jailbreaking methods such as GCG, TAP, and ICA in terms of success rate. Additionally, CIA exhibits strong transferability, where preliminary questions crafted for one LLM show high success rates when applied to other LLMs. The paper also evaluates the effectiveness of CIA against various defense strategies, including perplexity defense, paraphrasing defense, and SmoothLLM. Results show that CIA is relatively robust to these defenses, particularly when the attack prompts are semantically aligned with the target model. The study highlights the importance of context in Jailbreaking attacks and suggests that future research should explore the potential of leveraging context vectors to develop new attack mechanisms and deepen the understanding of LLM security.
Reach us at info@study.space