Robust Concept Erasure Using Task Vectors

Robust Concept Erasure Using Task Vectors

4 Apr 2024 | Minh Pham, Kelly O. Marshall, Chinmay Hegde, and Niv Cohen
This paper presents a method for robust concept erasure in text-to-image (T2I) generative models using Task Vectors (TVs). Current concept erasure methods are often input-dependent, meaning they only suppress the generation of a targeted concept when explicitly prompted with its textual name. However, these methods can be bypassed by prompts not seen during training, leading to unsafe generations. To address this, the authors propose a method that erases concepts unconditionally, without relying on specific user prompts. The core idea is based on TV-based editing, which represents a displacement in the model's weight space resulting from fine-tuning. TVs can be used to edit large models flexibly through arithmetic operations. Unlike input-dependent methods, TV-based editing is independent of any specific user input, making it more robust to unexpected prompts. To apply TV-based concept erasure, the model is first fine-tuned to generate a specific concept or style, and the resulting weight difference is referred to as the TV. The TV is then subtracted from the original model to erase the unsafe concept. To estimate the required strength of the TV edit, the authors propose a method called Diverse Inversion. This method finds a large set of word embeddings that induce the generation of the target concept. Encouraging diversity in the set improves the estimation's robustness to unexpected prompts. Diverse Inversion allows the authors to apply a TV edit only to a subset of the model weights, enhancing erasure capabilities while better maintaining the model's core functionality. The authors evaluate their method on toy models and demonstrate that TV-based editing provides better unconditional safety. They also investigate whether TV edits can be applied to large T2I models without compromising the model's core functionality. They find that TV edits can be applied while optimizing the trade-off between concept erasure and model performance. The trade-off is characterized by a parameter defining the edit strength, which is a scalar multiplying the vector magnitude. To tune this parameter without relying on any given prompt, the authors propose Diverse Inversion. The authors show that Diverse Inversion allows them to find a good value for the edit strength parameter and select a subset of model weights to edit, achieving a better trade-off between concept erasure and control task performance. They also demonstrate that TV-based concept erasure is robust against adversarial inputs, as shown in the experiments. The results indicate that TV-based erasure is more robust than input-dependent methods and can be applied to large T2I models without compromising their core functionality.This paper presents a method for robust concept erasure in text-to-image (T2I) generative models using Task Vectors (TVs). Current concept erasure methods are often input-dependent, meaning they only suppress the generation of a targeted concept when explicitly prompted with its textual name. However, these methods can be bypassed by prompts not seen during training, leading to unsafe generations. To address this, the authors propose a method that erases concepts unconditionally, without relying on specific user prompts. The core idea is based on TV-based editing, which represents a displacement in the model's weight space resulting from fine-tuning. TVs can be used to edit large models flexibly through arithmetic operations. Unlike input-dependent methods, TV-based editing is independent of any specific user input, making it more robust to unexpected prompts. To apply TV-based concept erasure, the model is first fine-tuned to generate a specific concept or style, and the resulting weight difference is referred to as the TV. The TV is then subtracted from the original model to erase the unsafe concept. To estimate the required strength of the TV edit, the authors propose a method called Diverse Inversion. This method finds a large set of word embeddings that induce the generation of the target concept. Encouraging diversity in the set improves the estimation's robustness to unexpected prompts. Diverse Inversion allows the authors to apply a TV edit only to a subset of the model weights, enhancing erasure capabilities while better maintaining the model's core functionality. The authors evaluate their method on toy models and demonstrate that TV-based editing provides better unconditional safety. They also investigate whether TV edits can be applied to large T2I models without compromising the model's core functionality. They find that TV edits can be applied while optimizing the trade-off between concept erasure and model performance. The trade-off is characterized by a parameter defining the edit strength, which is a scalar multiplying the vector magnitude. To tune this parameter without relying on any given prompt, the authors propose Diverse Inversion. The authors show that Diverse Inversion allows them to find a good value for the edit strength parameter and select a subset of model weights to edit, achieving a better trade-off between concept erasure and control task performance. They also demonstrate that TV-based concept erasure is robust against adversarial inputs, as shown in the experiments. The results indicate that TV-based erasure is more robust than input-dependent methods and can be applied to large T2I models without compromising their core functionality.
Reach us at info@study.space
[slides] Robust Concept Erasure Using Task Vectors | StudySpace