On the Multi-modal Vulnerability of Diffusion Models

On the Multi-modal Vulnerability of Diffusion Models

3 Jan 2025 | Dingcheng Yang, Yang Bai, Xiaojun Jia, Yang Liu, Xiaochun Cao, Wenjian Yu
This paper investigates the multi-modal vulnerability of diffusion models, focusing on the differences in text and image feature spaces embedded by diffusion models. The authors first visualize both text and image feature spaces and observe that text features are chaotic while image features are clustered according to their subjects. This suggests a potential misalignment in robustness between the two modalities. Based on this observation, they propose MMP-Attack, which leverages multi-modal priors (MMP) to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt. The goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object. MMP-Attack shows a notable advantage over existing studies with superior manipulation capability and efficiency. The authors conduct comprehensive experiments demonstrating the superior universality and transferability of their optimized suffix. The results show an attack success rate of 50.4% in manipulating Stable Diffusion v2.1 using the suffix generated on SD v14. The authors also find that MMP-Attack often works in a cheating way, containing some tokens related to the target object. The major contributions include the first visual analysis of both text and image feature spaces embedded by diffusion models, the proposal of MMP-Attack, and experimental results showing over 81.8% attack success rates on two open-source T2I models. The paper also discusses related work, including diffusion models and manipulation in T2I generation. The authors evaluate their method on five object categories from the Microsoft COCO dataset, showing that MMP-Attack outperforms existing methods in targeted attack scenarios. The results demonstrate the effectiveness of MMP-Attack in generating images with the target category while excluding the original category. The paper also discusses the universality and transferability of the cheating suffix, showing that it can be effective across different diffusion models and even commercial T2I services. The authors conclude that their work contributes to a deeper understanding of T2I generation and establishes a novel paradigm for adversarial studies in AI-generated content.This paper investigates the multi-modal vulnerability of diffusion models, focusing on the differences in text and image feature spaces embedded by diffusion models. The authors first visualize both text and image feature spaces and observe that text features are chaotic while image features are clustered according to their subjects. This suggests a potential misalignment in robustness between the two modalities. Based on this observation, they propose MMP-Attack, which leverages multi-modal priors (MMP) to manipulate the generation results of diffusion models by appending a specific suffix to the original prompt. The goal is to induce diffusion models to generate a specific object while simultaneously eliminating the original object. MMP-Attack shows a notable advantage over existing studies with superior manipulation capability and efficiency. The authors conduct comprehensive experiments demonstrating the superior universality and transferability of their optimized suffix. The results show an attack success rate of 50.4% in manipulating Stable Diffusion v2.1 using the suffix generated on SD v14. The authors also find that MMP-Attack often works in a cheating way, containing some tokens related to the target object. The major contributions include the first visual analysis of both text and image feature spaces embedded by diffusion models, the proposal of MMP-Attack, and experimental results showing over 81.8% attack success rates on two open-source T2I models. The paper also discusses related work, including diffusion models and manipulation in T2I generation. The authors evaluate their method on five object categories from the Microsoft COCO dataset, showing that MMP-Attack outperforms existing methods in targeted attack scenarios. The results demonstrate the effectiveness of MMP-Attack in generating images with the target category while excluding the original category. The paper also discusses the universality and transferability of the cheating suffix, showing that it can be effective across different diffusion models and even commercial T2I services. The authors conclude that their work contributes to a deeper understanding of T2I generation and establishes a novel paradigm for adversarial studies in AI-generated content.
Reach us at info@futurestudyspace.com