SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMs

SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMs

9 Apr 2024 | Bibek Upadhayay & Vahid Behzadan, Ph.D
This paper introduces a new black-box attack method called the *Sandwich Attack*, which is a multi-language mixture adaptive attack designed to manipulate state-of-the-art Large Language Models (LLMs) into generating harmful and misaligned responses. The authors explore the vulnerabilities of LLMs in multilingual settings, where attackers exploit unbalanced pre-training datasets and lower model performance in low-resource languages. The Sandwich Attack involves creating a prompt with five questions in different low-resource languages, with the adversarial question hidden in the middle. The authors test this attack on five models: Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS. Their experiments show that the Sandwich Attack can successfully elicit harmful responses from these models, highlighting the need for more secure and resilient LLMs. The paper also discusses the impact of the attack, model behaviors under attack, and potential future research directions.This paper introduces a new black-box attack method called the *Sandwich Attack*, which is a multi-language mixture adaptive attack designed to manipulate state-of-the-art Large Language Models (LLMs) into generating harmful and misaligned responses. The authors explore the vulnerabilities of LLMs in multilingual settings, where attackers exploit unbalanced pre-training datasets and lower model performance in low-resource languages. The Sandwich Attack involves creating a prompt with five questions in different low-resource languages, with the adversarial question hidden in the middle. The authors test this attack on five models: Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS. Their experiments show that the Sandwich Attack can successfully elicit harmful responses from these models, highlighting the need for more secure and resilient LLMs. The paper also discusses the impact of the attack, model behaviors under attack, and potential future research directions.
Reach us at info@study.space
[slides and audio] Sandwich attack%3A Multi-language Mixture Adaptive Attack on LLMs