17 Jul 2024 | Zhongqi Wang, Jie Zhang, Shiguang Shan, and Xilin Chen
T2IShield is a comprehensive defense method proposed to detect, localize, and mitigate backdoor attacks on text-to-image diffusion models. The method identifies the "Assimilation Phenomenon," where backdoor triggers assimilate the attention maps of other tokens, leading to consistent structural attention responses in backdoor samples. Based on this insight, two effective backdoor detection methods are proposed: Frobenius Norm Threshold Truncation (FTT) and Covariance Discriminant Analysis (CDA). FTT calculates the Frobenius norm of attention maps to classify backdoor samples, while CDA leverages covariance to represent the fine-grained structural correlation of attention maps and applies Linear Discriminant Analysis (LDA) for classification. A binary-search-based method is introduced to localize the trigger within a backdoor sample, and existing concept editing methods are assessed for their effectiveness in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show that T2IShield achieves a detection F1 score of 88.9% with low computational cost, a localization F1 score of 86.4%, and invalidates 99% of poisoned samples. The method is effective in detecting and mitigating backdoor attacks on text-to-image diffusion models.T2IShield is a comprehensive defense method proposed to detect, localize, and mitigate backdoor attacks on text-to-image diffusion models. The method identifies the "Assimilation Phenomenon," where backdoor triggers assimilate the attention maps of other tokens, leading to consistent structural attention responses in backdoor samples. Based on this insight, two effective backdoor detection methods are proposed: Frobenius Norm Threshold Truncation (FTT) and Covariance Discriminant Analysis (CDA). FTT calculates the Frobenius norm of attention maps to classify backdoor samples, while CDA leverages covariance to represent the fine-grained structural correlation of attention maps and applies Linear Discriminant Analysis (LDA) for classification. A binary-search-based method is introduced to localize the trigger within a backdoor sample, and existing concept editing methods are assessed for their effectiveness in mitigating backdoor attacks. Empirical evaluations on two advanced backdoor attack scenarios show that T2IShield achieves a detection F1 score of 88.9% with low computational cost, a localization F1 score of 86.4%, and invalidates 99% of poisoned samples. The method is effective in detecting and mitigating backdoor attacks on text-to-image diffusion models.