7 Jun 2024 | Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
This paper addresses the critical issue of safety in large language models (LLMs) by proposing a novel ASCII art-based jailbreak attack, ArtPrompt, and introducing the Vision-in-Text Challenge (ViTC) to evaluate LLMs' ability to recognize prompts that cannot be interpreted solely semantically. The authors demonstrate that five state-of-the-art (SOTA) LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts presented as ASCII art, which are visually represented but semantically interpreted. ArtPrompt leverages this vulnerability by replacing sensitive words in a prompt with ASCII art, bypassing safety measures and inducing undesired behaviors. Extensive experiments show that ArtPrompt effectively and efficiently induces unsafe behaviors from all five LLMs, outperforming other jailbreak attacks. The paper also evaluates ArtPrompt against various defenses, finding that it can successfully bypass most of them. The authors conclude by highlighting the need for more advanced defenses and suggesting that fine-tuning LLMs using corpora that are not solely semantically interpreted could mitigate these vulnerabilities.This paper addresses the critical issue of safety in large language models (LLMs) by proposing a novel ASCII art-based jailbreak attack, ArtPrompt, and introducing the Vision-in-Text Challenge (ViTC) to evaluate LLMs' ability to recognize prompts that cannot be interpreted solely semantically. The authors demonstrate that five state-of-the-art (SOTA) LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts presented as ASCII art, which are visually represented but semantically interpreted. ArtPrompt leverages this vulnerability by replacing sensitive words in a prompt with ASCII art, bypassing safety measures and inducing undesired behaviors. Extensive experiments show that ArtPrompt effectively and efficiently induces unsafe behaviors from all five LLMs, outperforming other jailbreak attacks. The paper also evaluates ArtPrompt against various defenses, finding that it can successfully bypass most of them. The authors conclude by highlighting the need for more advanced defenses and suggesting that fine-tuning LLMs using corpora that are not solely semantically interpreted could mitigate these vulnerabilities.