**Abstract:**
Recent advancements in Text-to-Image (T2I) models have raised concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) content. To address this, the study introduces GUARDT2I, a novel moderation framework that enhances T2I models' robustness against adversarial prompts. Instead of binary classification, GUARDT2I uses a Large Language Model (LLM) to conditionally transform text guidance embeddings into natural language, effectively detecting adversarial prompts without compromising model performance. Extensive experiments show that GUARDT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator across diverse adversarial scenarios.
**Introduction:**
The rapid development of T2I models has led to increased attention on their ethical and safety implications, particularly regarding the generation of NSFW content. Traditional defensive methods, such as model fine-tuning and post-hoc content moderation, have limitations in handling adversarial prompts. GUARDT2I addresses these issues by employing a generative approach, leveraging LLMs to interpret latent representations of prompts into plain text, enabling effective detection and rejection of malicious prompts.
**Method:**
GUARDT2I consists of three main components: a conditional LLM (c-LLM) for text generation, a Verbalizer for identifying sensitive words, and a Sentence Similarity Checker for comparing prompt interpretations. The c-LLM is trained on a large prompt dataset to convert guidance embeddings into natural language, while the Verbalizer and Sentence Similarity Checker ensure accurate and interpretable moderation.
**Experiments:**
Guaranteed by extensive experiments, GUARDT2I demonstrates superior performance over existing methods in detecting adversarial prompts, with higher AUROC, AUPRC, and lower FPR@TPR95 values. It also shows strong generalizability across various NSFW themes, outperforming classifier-based models that struggle with unseen content.
**Conclusion:**
GUARDT2I offers a significant advancement in defending against adversarial prompts in T2I models, enhancing both robustness and interpretability. Its generative approach ensures safe and responsible use of T2I systems without compromising generation quality.**Abstract:**
Recent advancements in Text-to-Image (T2I) models have raised concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) content. To address this, the study introduces GUARDT2I, a novel moderation framework that enhances T2I models' robustness against adversarial prompts. Instead of binary classification, GUARDT2I uses a Large Language Model (LLM) to conditionally transform text guidance embeddings into natural language, effectively detecting adversarial prompts without compromising model performance. Extensive experiments show that GUARDT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator across diverse adversarial scenarios.
**Introduction:**
The rapid development of T2I models has led to increased attention on their ethical and safety implications, particularly regarding the generation of NSFW content. Traditional defensive methods, such as model fine-tuning and post-hoc content moderation, have limitations in handling adversarial prompts. GUARDT2I addresses these issues by employing a generative approach, leveraging LLMs to interpret latent representations of prompts into plain text, enabling effective detection and rejection of malicious prompts.
**Method:**
GUARDT2I consists of three main components: a conditional LLM (c-LLM) for text generation, a Verbalizer for identifying sensitive words, and a Sentence Similarity Checker for comparing prompt interpretations. The c-LLM is trained on a large prompt dataset to convert guidance embeddings into natural language, while the Verbalizer and Sentence Similarity Checker ensure accurate and interpretable moderation.
**Experiments:**
Guaranteed by extensive experiments, GUARDT2I demonstrates superior performance over existing methods in detecting adversarial prompts, with higher AUROC, AUPRC, and lower FPR@TPR95 values. It also shows strong generalizability across various NSFW themes, outperforming classifier-based models that struggle with unseen content.
**Conclusion:**
GUARDT2I offers a significant advancement in defending against adversarial prompts in T2I models, enhancing both robustness and interpretability. Its generative approach ensures safe and responsible use of T2I systems without compromising generation quality.