Understanding Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

This article discusses the vulnerability of Large Vision-Language Models (LVLMs) to typographic attacks, where misleading text is superimposed on images to deceive the model. While prior work has shown that such attacks can harm models like CLIP, the susceptibility of newer LVLMs remains underexplored. The study introduces a benchmark for evaluating typographic attacks on LVLMs and proposes two novel self-generated attack methods. The first, Class Based Attacks, involves prompting the LVLM to identify a similar class to the target, while the second, Descriptive Attacks, asks the model to recommend a more convincing attack with both a deceptive class and a descriptive justification. Using this benchmark, the researchers found that self-generated attacks significantly reduce LVLM classification performance by up to 33%. These attacks are not only effective against the model that generated them but also against other models like InstructBLIP and MiniGPT4. The study highlights the need for improved defenses against such attacks, as they exploit the models' strong language capabilities and reliance on textual cues. The results emphasize the importance of addressing typographic vulnerabilities in LVLMs to ensure their reliability and safety.This article discusses the vulnerability of Large Vision-Language Models (LVLMs) to typographic attacks, where misleading text is superimposed on images to deceive the model. While prior work has shown that such attacks can harm models like CLIP, the susceptibility of newer LVLMs remains underexplored. The study introduces a benchmark for evaluating typographic attacks on LVLMs and proposes two novel self-generated attack methods. The first, Class Based Attacks, involves prompting the LVLM to identify a similar class to the target, while the second, Descriptive Attacks, asks the model to recommend a more convincing attack with both a deceptive class and a descriptive justification. Using this benchmark, the researchers found that self-generated attacks significantly reduce LVLM classification performance by up to 33%. These attacks are not only effective against the model that generated them but also against other models like InstructBLIP and MiniGPT4. The study highlights the need for improved defenses against such attacks, as they exploit the models' strong language capabilities and reliance on textual cues. The results emphasize the importance of addressing typographic vulnerabilities in LVLMs to ensure their reliability and safety.

Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

16 Feb 2024 | Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer