**SAFE GEN: Mitigating Unsafe Content Generation in Text-to-Image Models**
**Authors:** Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu
**Institution:** USSLAB, Zhejiang University; Johns Hopkins University
**Abstract:**
Text-to-image (T2I) models, such as Stable Diffusion, have shown remarkable performance in generating high-quality images from text descriptions. However, these models can be tricked into generating unsafe content, particularly in sexual scenarios. Existing countermeasures primarily focus on filtering inappropriate inputs and outputs or suppressing improper text embeddings, which can block explicit NSFW-related content but may still be vulnerable to adversarial prompts—inputs that appear innocent but are ill-intended. This paper presents SAFE GEN, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input, making the model resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments on four datasets demonstrate SAFE GEN's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SAFE GEN outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Additionally, the constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.
**Contributions:**
1. **Analysis of Adversarial Prompts:** We reveal the risk of adversarial prompts through theoretical and experimental analysis, highlighting the need for a text-agnostic defense framework.
2. **Design of Text-Agnostic SAFEGEN:** We propose a novel text-agnostic model editing technique that removes the capability of creating sexually explicit images from T2I models by regulating the vision-only self-attention layers.
3. **Comprehensive Evaluation:** We conduct extensive evaluations with eight baselines on a novel benchmark that includes representative and diverse test samples, verifying the effectiveness of our method.
**Background:**
- **Diffusion Models:** Denoising diffusion models, such as DDPM and DDIM, divide image generation into step-by-step sub-tasks, achieving state-of-the-art performance.
- **Text-to-Image (T2I) Generation:** T2I models, like Stable Diffusion, take text as input and generate visually realistic and semantically consistent images.
- **Attention Mechanism in T2I Models:** T2I models use cross-attention layers for text-dependent guidance and self-attention layers for vision-only information, which are crucial for suppressing unsafe image generation.
**Threat Model:**
- **Adversary:** Aims to generate unsafe content using adversarial prompts.
- **Model Governor:** Seeks to safeguard T2I models from generating unsafe content and ensure high-quality image generation for benign prompts.
**Design**SAFE GEN: Mitigating Unsafe Content Generation in Text-to-Image Models**
**Authors:** Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, Wenyuan Xu
**Institution:** USSLAB, Zhejiang University; Johns Hopkins University
**Abstract:**
Text-to-image (T2I) models, such as Stable Diffusion, have shown remarkable performance in generating high-quality images from text descriptions. However, these models can be tricked into generating unsafe content, particularly in sexual scenarios. Existing countermeasures primarily focus on filtering inappropriate inputs and outputs or suppressing improper text embeddings, which can block explicit NSFW-related content but may still be vulnerable to adversarial prompts—inputs that appear innocent but are ill-intended. This paper presents SAFE GEN, a framework to mitigate unsafe content generation by text-to-image models in a text-agnostic manner. The key idea is to eliminate unsafe visual representations from the model regardless of the text input, making the model resistant to adversarial prompts since unsafe visual representations are obstructed from within. Extensive experiments on four datasets demonstrate SAFE GEN's effectiveness in mitigating unsafe content generation while preserving the high-fidelity of benign images. SAFE GEN outperforms eight state-of-the-art baseline methods and achieves 99.1% sexual content removal performance. Additionally, the constructed benchmark of adversarial prompts provides a basis for future development and evaluation of anti-NSFW-generation methods.
**Contributions:**
1. **Analysis of Adversarial Prompts:** We reveal the risk of adversarial prompts through theoretical and experimental analysis, highlighting the need for a text-agnostic defense framework.
2. **Design of Text-Agnostic SAFEGEN:** We propose a novel text-agnostic model editing technique that removes the capability of creating sexually explicit images from T2I models by regulating the vision-only self-attention layers.
3. **Comprehensive Evaluation:** We conduct extensive evaluations with eight baselines on a novel benchmark that includes representative and diverse test samples, verifying the effectiveness of our method.
**Background:**
- **Diffusion Models:** Denoising diffusion models, such as DDPM and DDIM, divide image generation into step-by-step sub-tasks, achieving state-of-the-art performance.
- **Text-to-Image (T2I) Generation:** T2I models, like Stable Diffusion, take text as input and generate visually realistic and semantically consistent images.
- **Attention Mechanism in T2I Models:** T2I models use cross-attention layers for text-dependent guidance and self-attention layers for vision-only information, which are crucial for suppressing unsafe image generation.
**Threat Model:**
- **Adversary:** Aims to generate unsafe content using adversarial prompts.
- **Model Governor:** Seeks to safeguard T2I models from generating unsafe content and ensure high-quality image generation for benign prompts.
**Design