The paper "T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models" addresses the vulnerability of text-to-image (T2I) diffusion models to backdoor attacks, which manipulate model outputs through malicious triggers. The authors propose a comprehensive defense method named T2IShield, which includes three main components: detection, localization, and mitigation.
1. **Detection**: T2IShield identifies backdoor samples by detecting the "Assimilation Phenomenon" in cross-attention maps, where the backdoor trigger assimilates the attention of other tokens. Two effective detection methods are proposed: Frobenius Norm Threshold Truncation (FTT) and Covariance Discriminant Analysis (CDA). FTT uses the F Norm to calculate the structural correlation of attention maps, while CDA leverages covariance to capture fine-grained structural correlations and applies Linear Discriminate Analysis (LDA) for classification.
2. **Localization**: T2IShield introduces a binary-search-based method to precisely locate the backdoor trigger within a backdoor sample. This method assumes that the half-split part of the prompt containing the trigger still generates the target content, allowing for accurate trigger localization.
3. **Mitigation**: T2IShield aims to mitigate the poisoned impact of backdoor triggers by leveraging concept editing methods. The effectiveness of two state-of-the-art concept editing methods, Refact and UCE, is evaluated, showing that Refact outperforms UCE in terms of both attack success rate and average similarity to benign outputs.
Experiments on two advanced backdoor attack scenarios demonstrate the effectiveness of T2IShield, achieving a detection F1 score of 88.9% and a localization F1 score of 86.4%, with 99% of poisoned samples being invalidated. The method is lightweight and efficient, making it suitable for practical deployment.The paper "T2IShield: Defending Against Backdoors on Text-to-Image Diffusion Models" addresses the vulnerability of text-to-image (T2I) diffusion models to backdoor attacks, which manipulate model outputs through malicious triggers. The authors propose a comprehensive defense method named T2IShield, which includes three main components: detection, localization, and mitigation.
1. **Detection**: T2IShield identifies backdoor samples by detecting the "Assimilation Phenomenon" in cross-attention maps, where the backdoor trigger assimilates the attention of other tokens. Two effective detection methods are proposed: Frobenius Norm Threshold Truncation (FTT) and Covariance Discriminant Analysis (CDA). FTT uses the F Norm to calculate the structural correlation of attention maps, while CDA leverages covariance to capture fine-grained structural correlations and applies Linear Discriminate Analysis (LDA) for classification.
2. **Localization**: T2IShield introduces a binary-search-based method to precisely locate the backdoor trigger within a backdoor sample. This method assumes that the half-split part of the prompt containing the trigger still generates the target content, allowing for accurate trigger localization.
3. **Mitigation**: T2IShield aims to mitigate the poisoned impact of backdoor triggers by leveraging concept editing methods. The effectiveness of two state-of-the-art concept editing methods, Refact and UCE, is evaluated, showing that Refact outperforms UCE in terms of both attack success rate and average similarity to benign outputs.
Experiments on two advanced backdoor attack scenarios demonstrate the effectiveness of T2IShield, achieving a detection F1 score of 88.9% and a localization F1 score of 86.4%, with 99% of poisoned samples being invalidated. The method is lightweight and efficient, making it suitable for practical deployment.