4 Apr 2024 | Minh Pham, Kelly O. Marshall, Chinmay Hegde, and Niv Cohen
The paper "Robust Concept Erasure Using Task Vectors" by Minh Pham, Kelly O. Marshall, Chinmay Hegde, and Niv Cohen from New York University addresses the challenge of preventing undesirable image generations from text-to-image (T2I) models. The authors propose a method to unconditionally erase a concept from a T2I model, rather than conditioning the erasure on specific user prompts. They demonstrate that using Task Vectors (TVs), which represent displacements in the model's weight space, can make the concept erasure more robust to unexpected inputs. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown.
To address this, the authors introduce Diverse Inversion, a method to estimate the required strength of the TV edit. Diverse Inversion finds a large set of word embeddings within the model input space that, when used as inputs, generate the target concept. By encouraging diversity in this set, the method becomes more robust to unexpected prompts. The authors also show that applying TV edits to a subset of the model weights can enhance the erasure capabilities while maintaining the model's core functionality.
The paper includes a detailed analysis of the vulnerability of current concept erasure methods to specific input prompts and demonstrates that TV-based editing provides better unconditional safety. The authors evaluate their method on a toy model and large T2I models, showing that TV-based concept erasure is more robust to adversarial inputs and can be tuned to balance the trade-off between concept erasure and model performance. The paper concludes with a discussion on the limitations and future directions, emphasizing the need for further research in model safety and the application of TV-based techniques to other modalities.The paper "Robust Concept Erasure Using Task Vectors" by Minh Pham, Kelly O. Marshall, Chinmay Hegde, and Niv Cohen from New York University addresses the challenge of preventing undesirable image generations from text-to-image (T2I) models. The authors propose a method to unconditionally erase a concept from a T2I model, rather than conditioning the erasure on specific user prompts. They demonstrate that using Task Vectors (TVs), which represent displacements in the model's weight space, can make the concept erasure more robust to unexpected inputs. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown.
To address this, the authors introduce Diverse Inversion, a method to estimate the required strength of the TV edit. Diverse Inversion finds a large set of word embeddings within the model input space that, when used as inputs, generate the target concept. By encouraging diversity in this set, the method becomes more robust to unexpected prompts. The authors also show that applying TV edits to a subset of the model weights can enhance the erasure capabilities while maintaining the model's core functionality.
The paper includes a detailed analysis of the vulnerability of current concept erasure methods to specific input prompts and demonstrates that TV-based editing provides better unconditional safety. The authors evaluate their method on a toy model and large T2I models, showing that TV-based concept erasure is more robust to adversarial inputs and can be tuned to balance the trade-off between concept erasure and model performance. The paper concludes with a discussion on the limitations and future directions, emphasizing the need for further research in model safety and the application of TV-based techniques to other modalities.