Jailbreaking Attack against Multimodal Large Language Model

Jailbreaking Attack against Multimodal Large Language Model

4 Feb 2024 | Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, Rong Jin
This paper presents a jailbreaking attack against multimodal large language models (MLLMs), aiming to elicit them to generate objectionable responses to harmful user queries. The authors propose a maximum likelihood-based algorithm to find an image Jailbreaking Prompt (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images. Their approach demonstrates strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2 in a black-box manner. They also reveal a connection between MLLM-jailbreaks and LLM-jailbreaks, introducing a construction-based method to harness their approach for LLM-jailbreaks, which is more efficient than current state-of-the-art methods. The code is available for download. The paper discusses two scenarios for MLLM-jailbreaking: one where no input image is given, and another where an input image is provided. The first scenario focuses on prompt-universal property, while the second scenario requires both prompt-universal and image-universal properties. The authors propose a maximum likelihood-based approach by modifying the objective function of adversarial attacks, which is suitable for generative tasks. They also introduce a construction-based method to convert an imgJP into a corresponding txtJP for LLM-jailbreaks, demonstrating superior efficiency. The paper evaluates their approach on several multimodal models, including MiniGPT-4, MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2. The results show that their approach achieves high attack success rates (ASR) across different models, demonstrating strong model-transferability. The authors also highlight the challenges of jailbreaking MLLMs, as they contain vulnerable visual modules that make them more susceptible to jailbreaking than pure LLMs. They conclude that jailbreaking MLLMs is easier than jailbreaking LLMs, and emphasize the serious concerns about MLLMs alignment.This paper presents a jailbreaking attack against multimodal large language models (MLLMs), aiming to elicit them to generate objectionable responses to harmful user queries. The authors propose a maximum likelihood-based algorithm to find an image Jailbreaking Prompt (imgJP), enabling jailbreaks against MLLMs across multiple unseen prompts and images. Their approach demonstrates strong model-transferability, as the generated imgJP can be transferred to jailbreak various models, including MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2 in a black-box manner. They also reveal a connection between MLLM-jailbreaks and LLM-jailbreaks, introducing a construction-based method to harness their approach for LLM-jailbreaks, which is more efficient than current state-of-the-art methods. The code is available for download. The paper discusses two scenarios for MLLM-jailbreaking: one where no input image is given, and another where an input image is provided. The first scenario focuses on prompt-universal property, while the second scenario requires both prompt-universal and image-universal properties. The authors propose a maximum likelihood-based approach by modifying the objective function of adversarial attacks, which is suitable for generative tasks. They also introduce a construction-based method to convert an imgJP into a corresponding txtJP for LLM-jailbreaks, demonstrating superior efficiency. The paper evaluates their approach on several multimodal models, including MiniGPT-4, MiniGPT-v2, LLaVA, InstructBLIP, and mPLUG-Owl2. The results show that their approach achieves high attack success rates (ASR) across different models, demonstrating strong model-transferability. The authors also highlight the challenges of jailbreaking MLLMs, as they contain vulnerable visual modules that make them more susceptible to jailbreaking than pure LLMs. They conclude that jailbreaking MLLMs is easier than jailbreaking LLMs, and emphasize the serious concerns about MLLMs alignment.
Reach us at info@study.space
[slides and audio] Jailbreaking Attack against Multimodal Large Language Model