ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

7 Jun 2024 | Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
This paper presents a novel jailbreak attack called ArtPrompt, which exploits the vulnerability of large language models (LLMs) in recognizing prompts that cannot be interpreted solely through semantics. The attack leverages the fact that LLMs struggle to interpret ASCII art, a form of text-based visual representation, and uses this weakness to bypass safety measures and elicit unintended behaviors from LLMs. The authors introduce a benchmark called Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that use ASCII art. The results show that five state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts in the form of ASCII art. Based on this observation, the authors develop ArtPrompt, which masks sensitive words in a prompt and replaces them with ASCII art to bypass safety measures. ArtPrompt requires only black-box access to the victim LLMs, making it a practical attack. The authors evaluate ArtPrompt on five LLMs and show that it can effectively and efficiently induce undesired behaviors from all five LLMs. The code for ArtPrompt is available at https://github.com/uw-nsl/ArtPrompt. The paper also compares ArtPrompt with other jailbreak attacks and shows that it outperforms them in terms of effectiveness and efficiency. The authors also evaluate ArtPrompt against three defenses (Perplexity, Paraphrase, and Retokenization) and show that it successfully bypasses all defenses. The paper highlights the importance of considering multiple interpretations of corpora beyond semantics in the safety alignment of LLMs.This paper presents a novel jailbreak attack called ArtPrompt, which exploits the vulnerability of large language models (LLMs) in recognizing prompts that cannot be interpreted solely through semantics. The attack leverages the fact that LLMs struggle to interpret ASCII art, a form of text-based visual representation, and uses this weakness to bypass safety measures and elicit unintended behaviors from LLMs. The authors introduce a benchmark called Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that use ASCII art. The results show that five state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts in the form of ASCII art. Based on this observation, the authors develop ArtPrompt, which masks sensitive words in a prompt and replaces them with ASCII art to bypass safety measures. ArtPrompt requires only black-box access to the victim LLMs, making it a practical attack. The authors evaluate ArtPrompt on five LLMs and show that it can effectively and efficiently induce undesired behaviors from all five LLMs. The code for ArtPrompt is available at https://github.com/uw-nsl/ArtPrompt. The paper also compares ArtPrompt with other jailbreak attacks and shows that it outperforms them in terms of effectiveness and efficiency. The authors also evaluate ArtPrompt against three defenses (Perplexity, Paraphrase, and Retokenization) and show that it successfully bypasses all defenses. The paper highlights the importance of considering multiple interpretations of corpora beyond semantics in the safety alignment of LLMs.
Reach us at info@study.space
[slides] ArtPrompt%3A ASCII Art-based Jailbreak Attacks against Aligned LLMs | StudySpace