Pruning for Robust Concept Erasing in Diffusion Models

Pruning for Robust Concept Erasing in Diffusion Models

26 May 2024 | Tianyun Yang, Juan Cao, and Chang Xu
This paper proposes a pruning-based strategy for robust concept erasing in diffusion models. Diffusion models, while powerful in generating images, can produce undesirable outputs such as NSFW content and copyrighted artworks. Existing concept erasing methods, which fine-tune model parameters to remove unwanted concepts, often fail to be robust against adversarial prompts, as the erased concepts can be regenerated through carefully crafted prompts. To address this, the authors identify concept-correlated neurons that are sensitive to adversarial prompts and propose a pruning strategy to deactivate these neurons, thereby reducing their sensitivity. The method selectively prunes critical parameters associated with the concepts to be removed, enhancing the model's robustness against adversarial inputs. The approach is integrated with existing concept-erasing techniques and has been validated through experiments showing significant improvements in erasing NSFW content and artwork styles. The method achieves a 40% improvement in erasing NSFW content and a 30% improvement in erasing artwork styles. The authors also demonstrate that the pruning approach maintains the model's ability to generate other standard concepts and does not compromise generation quality. The method is flexible and can be applied to various concept-erasing objectives. The results show that the proposed method significantly improves the robustness of diffusion models across three test environments: erasing nudity, styles, and objects. The method is effective in reducing the sensitivity of diffusion models, justifying the observed improvement in robustness. The paper also discusses related work, including concept erasing in diffusion models and neural network pruning, and presents a detailed analysis of the proposed method. The experiments show that the method outperforms existing approaches in terms of robustness and concept erasing performance. The authors conclude that their method provides a robust solution for concept erasing in diffusion models, addressing the issue of generating inappropriate content in real-world applications.This paper proposes a pruning-based strategy for robust concept erasing in diffusion models. Diffusion models, while powerful in generating images, can produce undesirable outputs such as NSFW content and copyrighted artworks. Existing concept erasing methods, which fine-tune model parameters to remove unwanted concepts, often fail to be robust against adversarial prompts, as the erased concepts can be regenerated through carefully crafted prompts. To address this, the authors identify concept-correlated neurons that are sensitive to adversarial prompts and propose a pruning strategy to deactivate these neurons, thereby reducing their sensitivity. The method selectively prunes critical parameters associated with the concepts to be removed, enhancing the model's robustness against adversarial inputs. The approach is integrated with existing concept-erasing techniques and has been validated through experiments showing significant improvements in erasing NSFW content and artwork styles. The method achieves a 40% improvement in erasing NSFW content and a 30% improvement in erasing artwork styles. The authors also demonstrate that the pruning approach maintains the model's ability to generate other standard concepts and does not compromise generation quality. The method is flexible and can be applied to various concept-erasing objectives. The results show that the proposed method significantly improves the robustness of diffusion models across three test environments: erasing nudity, styles, and objects. The method is effective in reducing the sensitivity of diffusion models, justifying the observed improvement in robustness. The paper also discusses related work, including concept erasing in diffusion models and neural network pruning, and presents a detailed analysis of the proposed method. The experiments show that the method outperforms existing approaches in terms of robustness and concept erasing performance. The authors conclude that their method provides a robust solution for concept erasing in diffusion models, addressing the issue of generating inappropriate content in real-world applications.
Reach us at info@study.space