SAFEGen is a text-agnostic framework designed to mitigate unsafe content generation in text-to-image (T2I) models. The key idea is to eliminate unsafe visual representations from the model regardless of the text input, making the model resistant to adversarial prompts. Extensive experiments on four datasets show that SAFEGen effectively reduces unsafe content generation while preserving the high-fidelity of benign images. SAFEGen outperforms eight state-of-the-art baseline methods, achieving 99.1% sexual content removal performance. The paper also introduces a benchmark of adversarial prompts for future development and evaluation of anti-NSFW-generation methods.
The paper discusses the challenges of T2I models generating unsafe content, particularly in sexual scenarios. Existing countermeasures focus on filtering inappropriate inputs and outputs or suppressing improper text embeddings, which can block explicit NSFW-related content but may still be vulnerable to adversarial prompts. The paper proposes SAFEGen, which regulates the vision-only self-attention layers to remove the capability of creating sexually explicit images from an already-trained T2I model. SAFEGen can also complement and seamlessly integrate with existing defense methods to further enhance the overall performance of unsafe image generation.
The paper presents three main contributions: (1) an in-depth analysis of the threat of adversarial prompts, (2) a text-agnostic model editing technique that removes the capability of creating sexually explicit images from T2I models, and (3) an extensive evaluation with eight baselines on a novel benchmark that comprises representative and diverse test samples. The paper also discusses the design of SAFEGen, including the rationale behind its text-agnostic design, the governing of vision-only self-attention layers, and the system integration with other defenses.
The paper evaluates SAFEGen on various metrics, including NSFW content removal, benign content preservation, and text-to-image alignment reduction. The results show that SAFEGen outperforms existing methods in mitigating NSFW generation and preserving benign generation. The paper also discusses the challenges of T2I models generating unsafe content and the limitations of existing countermeasures. The paper concludes that SAFEGen provides a robust solution to mitigate unsafe content generation in T2I models.SAFEGen is a text-agnostic framework designed to mitigate unsafe content generation in text-to-image (T2I) models. The key idea is to eliminate unsafe visual representations from the model regardless of the text input, making the model resistant to adversarial prompts. Extensive experiments on four datasets show that SAFEGen effectively reduces unsafe content generation while preserving the high-fidelity of benign images. SAFEGen outperforms eight state-of-the-art baseline methods, achieving 99.1% sexual content removal performance. The paper also introduces a benchmark of adversarial prompts for future development and evaluation of anti-NSFW-generation methods.
The paper discusses the challenges of T2I models generating unsafe content, particularly in sexual scenarios. Existing countermeasures focus on filtering inappropriate inputs and outputs or suppressing improper text embeddings, which can block explicit NSFW-related content but may still be vulnerable to adversarial prompts. The paper proposes SAFEGen, which regulates the vision-only self-attention layers to remove the capability of creating sexually explicit images from an already-trained T2I model. SAFEGen can also complement and seamlessly integrate with existing defense methods to further enhance the overall performance of unsafe image generation.
The paper presents three main contributions: (1) an in-depth analysis of the threat of adversarial prompts, (2) a text-agnostic model editing technique that removes the capability of creating sexually explicit images from T2I models, and (3) an extensive evaluation with eight baselines on a novel benchmark that comprises representative and diverse test samples. The paper also discusses the design of SAFEGen, including the rationale behind its text-agnostic design, the governing of vision-only self-attention layers, and the system integration with other defenses.
The paper evaluates SAFEGen on various metrics, including NSFW content removal, benign content preservation, and text-to-image alignment reduction. The results show that SAFEGen outperforms existing methods in mitigating NSFW generation and preserving benign generation. The paper also discusses the challenges of T2I models generating unsafe content and the limitations of existing countermeasures. The paper concludes that SAFEGen provides a robust solution to mitigate unsafe content generation in T2I models.