JAILBREAKING PROMPT ATTACK: A CONTROLLABLE ADVERSARIAL ATTACK AGAINST DIFFUSION MODELS

JAILBREAKING PROMPT ATTACK: A CONTROLLABLE ADVERSARIAL ATTACK AGAINST DIFFUSION MODELS

2 Jun 2024 | Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, Junbo Zhao
The paper "Jailbreaking Prompt Attack: A Controllable Adversarial Attack Against Diffusion Models" addresses the ethical concerns of Text-to-Image (T2I) models generating Not Safe for Work (NSFW) images. To mitigate this issue, T2I models employ various safety checkers, but these checkers are not foolproof. The authors propose the Jailbreaking Prompt Attack (JPA), an automated attack framework designed to bypass these safety checkers while preserving the semantic content of the original images. JPA works by searching for unique extra tokens in the text space that can bypass the safety checkers. The process involves rendering the target prompt in the embedding space, adding or subtracting specific concept embeddings, and then projecting the rendered embedding back into the token space to create adversarial prompts. The authors use a cosine similarity metric to ensure that the adversarial prompts remain semantically similar to the original prompts. The evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses, generating NSFW images. The paper also discusses the limitations of the attack, such as the effectiveness being reduced when using completely safe datasets. Overall, the work highlights the robustness of the text space as a potential vulnerability in T2I models and calls for more effective safety checkers to address this issue.The paper "Jailbreaking Prompt Attack: A Controllable Adversarial Attack Against Diffusion Models" addresses the ethical concerns of Text-to-Image (T2I) models generating Not Safe for Work (NSFW) images. To mitigate this issue, T2I models employ various safety checkers, but these checkers are not foolproof. The authors propose the Jailbreaking Prompt Attack (JPA), an automated attack framework designed to bypass these safety checkers while preserving the semantic content of the original images. JPA works by searching for unique extra tokens in the text space that can bypass the safety checkers. The process involves rendering the target prompt in the embedding space, adding or subtracting specific concept embeddings, and then projecting the rendered embedding back into the token space to create adversarial prompts. The authors use a cosine similarity metric to ensure that the adversarial prompts remain semantically similar to the original prompts. The evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses, generating NSFW images. The paper also discusses the limitations of the attack, such as the effectiveness being reduced when using completely safe datasets. Overall, the work highlights the robustness of the text space as a potential vulnerability in T2I models and calls for more effective safety checkers to address this issue.
Reach us at info@study.space