[slides and audio] Automatic Jailbreaking of the Text-to-Image Generative AI Systems

This paper addresses the issue of copyright infringement in commercial Text-to-Image (T2I) generative AI systems, such as ChatGPT, Copilot, and Gemini. The authors evaluate the safety of these systems by constructing a dataset called VioT, which includes five categories of copyrighted content. They find that while ChatGPT blocks 84% of copyright violations with naive prompts, other systems like Copilot and Gemini block only 12% and 17%, respectively. To further test the robustness of these systems, the authors propose an Automated Prompt Generation Pipeline (APGP), which uses a large language model (LLM) to generate prompts that bypass the safety mechanisms. The APGP optimizes prompts based on self-generated QA scores and keyword penalties, effectively generating prompts that violate copyright laws. The results show that APGP can successfully generate prompts that result in 76% of the generated images being considered copyright-infringing, with a block rate of only 11.0% for ChatGPT. The paper also explores various defense strategies, such as post-generation filtering and machine unlearning techniques, but finds them inadequate. The authors conclude that stronger defense mechanisms are necessary to address the risks of copyright infringement in T2I systems.This paper addresses the issue of copyright infringement in commercial Text-to-Image (T2I) generative AI systems, such as ChatGPT, Copilot, and Gemini. The authors evaluate the safety of these systems by constructing a dataset called VioT, which includes five categories of copyrighted content. They find that while ChatGPT blocks 84% of copyright violations with naive prompts, other systems like Copilot and Gemini block only 12% and 17%, respectively. To further test the robustness of these systems, the authors propose an Automated Prompt Generation Pipeline (APGP), which uses a large language model (LLM) to generate prompts that bypass the safety mechanisms. The APGP optimizes prompts based on self-generated QA scores and keyword penalties, effectively generating prompts that violate copyright laws. The results show that APGP can successfully generate prompts that result in 76% of the generated images being considered copyright-infringing, with a block rate of only 11.0% for ChatGPT. The paper also explores various defense strategies, such as post-generation filtering and machine unlearning techniques, but finds them inadequate. The authors conclude that stronger defense mechanisms are necessary to address the risks of copyright infringement in T2I systems.

Automatic Jailbreaking of the Text-to-Image Generative AI Systems

28 May 2024 | Minseon Kim, Hyomin Lee, Boqing Gong, Huishuai Zhang, Sung Ju Hwang