2 Jun 2024 | Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, Junbo Zhao
This paper introduces Jailbreaking Prompt Attack (JPA), an automated attack framework that bypasses safety checkers in Text-to-Image (T2I) models to generate Not Safe for Work (NSFW) images. T2I models, such as Stable Diffusion, DALL·E 2, and Midjourney, are widely used for image generation but face challenges in preventing the creation of NSFW content. To address this, T2I models employ various safety checkers, including classification-based and removal-based checkers, to filter out unsafe content. However, these checkers are not foolproof, and researchers have developed methods to bypass them.
JPA targets the robustness of the text space, where small perturbations in text can lead to significant changes in generated images. The framework aims to find prompts that bypass safety checkers while preserving the semantic meaning of the original prompts. By adding learnable tokens to the original prompts, JPA can generate NSFW images that are semantically similar to the original ones. The method uses a text encoder to design the learning objective, ensuring that the generated prompts remain semantically close to the target concept.
The paper evaluates JPA on both online and offline T2I models, demonstrating its effectiveness in bypassing various safety checkers. The results show that JPA can generate NSFW images while maintaining the semantic integrity of the original prompts. Additionally, the method includes a sensitive word masking technique to avoid detection by text-based safety checkers.
The study highlights the vulnerabilities of T2I models to adversarial attacks, particularly in the text space. The findings suggest that small perturbations in text can lead to significant changes in generated images, posing a risk to the security of T2I models. The paper also emphasizes the importance of developing more robust safety checkers to prevent the generation of NSFW content. Overall, JPA demonstrates the potential for adversarial attacks on T2I models and underscores the need for improved security measures in the field of image generation.This paper introduces Jailbreaking Prompt Attack (JPA), an automated attack framework that bypasses safety checkers in Text-to-Image (T2I) models to generate Not Safe for Work (NSFW) images. T2I models, such as Stable Diffusion, DALL·E 2, and Midjourney, are widely used for image generation but face challenges in preventing the creation of NSFW content. To address this, T2I models employ various safety checkers, including classification-based and removal-based checkers, to filter out unsafe content. However, these checkers are not foolproof, and researchers have developed methods to bypass them.
JPA targets the robustness of the text space, where small perturbations in text can lead to significant changes in generated images. The framework aims to find prompts that bypass safety checkers while preserving the semantic meaning of the original prompts. By adding learnable tokens to the original prompts, JPA can generate NSFW images that are semantically similar to the original ones. The method uses a text encoder to design the learning objective, ensuring that the generated prompts remain semantically close to the target concept.
The paper evaluates JPA on both online and offline T2I models, demonstrating its effectiveness in bypassing various safety checkers. The results show that JPA can generate NSFW images while maintaining the semantic integrity of the original prompts. Additionally, the method includes a sensitive word masking technique to avoid detection by text-based safety checkers.
The study highlights the vulnerabilities of T2I models to adversarial attacks, particularly in the text space. The findings suggest that small perturbations in text can lead to significant changes in generated images, posing a risk to the security of T2I models. The paper also emphasizes the importance of developing more robust safety checkers to prevent the generation of NSFW content. Overall, JPA demonstrates the potential for adversarial attacks on T2I models and underscores the need for improved security measures in the field of image generation.