GuardT2I: Defending Text-to-Image Models from Adversarial Prompts
Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu
Abstract: Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GUARD T2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GUARD T2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GUARD T2I outperforms leading commercial solutions like OpenAI Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios.
Introduction: As the application of Text-to-Image (T2I) models is developed rapidly, the ethical and safety implications associated with their deployment gain increased prominence. One of the most notable issues lies in the generation of inappropriate or Not-Safe-for-Work (NSFW) content, including but not limited to pornography, bullying, gore, political sensitivity, and racism. Defensive methods to address these concerns can be broadly categorized into two classes: model fine-tuning and post-hoc content moderation. The limitations of post-hoc content moderation are inherent to its design principle, relying on classification tasks. This success inspires us to perform a similar paradigm shift for T2I content moderation.
In this paper, we present GUARD T2I, an innovative moderation framework specifically designed for T2I models. Our observation is that while adversarial prompts, as illustrated in Figure 1, exhibit noticeable visual distinctions when compared to normal NSFW prompts (e.g., "A naked man"), they share the same underlying semantic information within the T2I model's latent space. We acknowledge that the latent space encompassing NSFW content lacks clear patterns, presenting a challenge for classifier-based approaches that rely on fixed decision boundaries to encompass all NSFW threats. In contrast, LLMs excel in processing semantic information and offer a promising alternative. Therefore, we propose to employ the LLM to "translate" the latent representation of prompts back to plain texts, which can reveal any kind of malicious intention. By moderating the translated text, GUARD T2I not only effectively identifies NSFW prompts, but also generalizes across various inappropriate contents.
However, translating the latent representation back to plain text presents a significant challenge due to the implicitness of latents. To resolve this issue, we incorporate a cross-attention module for them byGuardT2I: Defending Text-to-Image Models from Adversarial Prompts
Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu
Abstract: Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GUARD T2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GUARD T2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GUARD T2I outperforms leading commercial solutions like OpenAI Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios.
Introduction: As the application of Text-to-Image (T2I) models is developed rapidly, the ethical and safety implications associated with their deployment gain increased prominence. One of the most notable issues lies in the generation of inappropriate or Not-Safe-for-Work (NSFW) content, including but not limited to pornography, bullying, gore, political sensitivity, and racism. Defensive methods to address these concerns can be broadly categorized into two classes: model fine-tuning and post-hoc content moderation. The limitations of post-hoc content moderation are inherent to its design principle, relying on classification tasks. This success inspires us to perform a similar paradigm shift for T2I content moderation.
In this paper, we present GUARD T2I, an innovative moderation framework specifically designed for T2I models. Our observation is that while adversarial prompts, as illustrated in Figure 1, exhibit noticeable visual distinctions when compared to normal NSFW prompts (e.g., "A naked man"), they share the same underlying semantic information within the T2I model's latent space. We acknowledge that the latent space encompassing NSFW content lacks clear patterns, presenting a challenge for classifier-based approaches that rely on fixed decision boundaries to encompass all NSFW threats. In contrast, LLMs excel in processing semantic information and offer a promising alternative. Therefore, we propose to employ the LLM to "translate" the latent representation of prompts back to plain texts, which can reveal any kind of malicious intention. By moderating the translated text, GUARD T2I not only effectively identifies NSFW prompts, but also generalizes across various inappropriate contents.
However, translating the latent representation back to plain text presents a significant challenge due to the implicitness of latents. To resolve this issue, we incorporate a cross-attention module for them by