[slides and audio] Sandwich attack%3A Multi-language Mixture Adaptive Attack on LLMs

This paper introduces a new black-box attack called the Sandwich attack, which is a multi-language mixture attack that manipulates state-of-the-art (SOTA) large language models (LLMs) into generating harmful and misaligned responses. The attack exploits the multilingual capabilities of LLMs, particularly their lower performance in low-resource languages, to manipulate the models into generating harmful content. The Sandwich attack involves creating a prompt with five questions in different low-resource languages, with the adversarial question hidden in the middle. The attack is designed to take advantage of the "Attention Blink" phenomenon in LLMs, where the model prioritizes the primary task and overlooks the secondary task, especially when processing information in multiple languages. The paper evaluates the Sandwich attack on five different SOTA models: Google Bard, GPT-3.5-Turbo, LLaMA-2-70B-Chat, GPT-4, and Claude-3-OPUS. The results show that the attack can successfully elicit harmful responses from these models, highlighting the vulnerability of LLMs to multi-language mixture attacks. The study also reveals that LLMs rely more on English text for safety mechanisms than on other non-English text. The findings suggest that the safety training of LLMs may not be sufficient to prevent harmful responses in multilingual settings, and that further research is needed to develop more robust and secure LLMs. The paper also discusses the implications of the Sandwich attack, including the potential for misuse and the need for improved safety mechanisms in multilingual LLMs.This paper introduces a new black-box attack called the Sandwich attack, which is a multi-language mixture attack that manipulates state-of-the-art (SOTA) large language models (LLMs) into generating harmful and misaligned responses. The attack exploits the multilingual capabilities of LLMs, particularly their lower performance in low-resource languages, to manipulate the models into generating harmful content. The Sandwich attack involves creating a prompt with five questions in different low-resource languages, with the adversarial question hidden in the middle. The attack is designed to take advantage of the "Attention Blink" phenomenon in LLMs, where the model prioritizes the primary task and overlooks the secondary task, especially when processing information in multiple languages. The paper evaluates the Sandwich attack on five different SOTA models: Google Bard, GPT-3.5-Turbo, LLaMA-2-70B-Chat, GPT-4, and Claude-3-OPUS. The results show that the attack can successfully elicit harmful responses from these models, highlighting the vulnerability of LLMs to multi-language mixture attacks. The study also reveals that LLMs rely more on English text for safety mechanisms than on other non-English text. The findings suggest that the safety training of LLMs may not be sufficient to prevent harmful responses in multilingual settings, and that further research is needed to develop more robust and secure LLMs. The paper also discusses the implications of the Sandwich attack, including the potential for misuse and the need for improved safety mechanisms in multilingual LLMs.

SANDWICH ATTACK: MULTI-LANGUAGE MIXTURE ADAPTIVE ATTACK ON LLMs

9 Apr 2024 | Bibek Upadhyay & Vahid Behzadan, Ph.D