[slides and audio] Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

This paper addresses the safety concerns of diffusion models (DMs) by enhancing their robustness against adversarial prompt attacks through a novel framework called *AdvUnlearn*. DMs, while achieving remarkable success in text-to-image generation, can generate harmful content when prompted with inappropriate texts. Traditional techniques like machine unlearning (concept erasing) are vulnerable to adversarial attacks, which can regenerate unwanted images. To overcome this, the authors integrate adversarial training (AT) into the unlearning process, developing a bi-level optimization (BLO) scheme. They introduce a utility-retaining regularization using a retained prompt set to balance the trade-off between concept erasure robustness and model utility. The text encoder is identified as a more suitable module for robustification compared to the UNet, and the learned text encoder can be shared across different DM types. Extensive experiments demonstrate that AdvUnlearn significantly improves robustness against adversarial prompt attacks while preserving image generation quality, outperforming existing methods in various unlearning scenarios, including the erasure of nudity, objects, and style concepts. The proposed approach is the first systematic exploration of AT in DM unlearning, setting it apart from existing methods that overlook robustness in concept erasing.This paper addresses the safety concerns of diffusion models (DMs) by enhancing their robustness against adversarial prompt attacks through a novel framework called *AdvUnlearn*. DMs, while achieving remarkable success in text-to-image generation, can generate harmful content when prompted with inappropriate texts. Traditional techniques like machine unlearning (concept erasing) are vulnerable to adversarial attacks, which can regenerate unwanted images. To overcome this, the authors integrate adversarial training (AT) into the unlearning process, developing a bi-level optimization (BLO) scheme. They introduce a utility-retaining regularization using a retained prompt set to balance the trade-off between concept erasure robustness and model utility. The text encoder is identified as a more suitable module for robustification compared to the UNet, and the learned text encoder can be shared across different DM types. Extensive experiments demonstrate that AdvUnlearn significantly improves robustness against adversarial prompt attacks while preserving image generation quality, outperforming existing methods in various unlearning scenarios, including the erasure of nudity, objects, and style concepts. The proposed approach is the first systematic exploration of AT in DM unlearning, setting it apart from existing methods that overlook robustness in concept erasing.

DEFENSIVE UNLEARNING WITH ADVERSARIAL TRAINING FOR ROBUST CONCEPT ERASURE IN DIFFUSION MODELS

14 Jun 2024 | Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu