30 May 2024 | Mingli Zhu1 Siyuan Liang2 Baoyuan Wu1*
This paper addresses the persistent challenge of defending against backdoor attacks in deep neural networks, which have shown promising performance in reducing attack success rates. However, the authors question whether these defenses truly eliminate the backdoor threat. They introduce a novel metric called the *backdoor existence coefficient* (BEC) to measure the extent of backdoor presence in defense models. Surprisingly, they find that original backdoors still exist in defense models, even though they cannot be activated by the original trigger. To verify this, they propose a backdoor re-activation attack that manipulates the original trigger with a well-designed perturbation using universal adversarial attacks. The effectiveness of this method is demonstrated in both white-box and black-box scenarios, where the adversary can only query the model during inference. The proposed methods are evaluated on image classification and multimodal contrastive learning tasks, showing significant improvements in attack success rates compared to defense models. The study highlights a critical vulnerability in existing defense strategies and emphasizes the need for more robust and advanced backdoor defense mechanisms.This paper addresses the persistent challenge of defending against backdoor attacks in deep neural networks, which have shown promising performance in reducing attack success rates. However, the authors question whether these defenses truly eliminate the backdoor threat. They introduce a novel metric called the *backdoor existence coefficient* (BEC) to measure the extent of backdoor presence in defense models. Surprisingly, they find that original backdoors still exist in defense models, even though they cannot be activated by the original trigger. To verify this, they propose a backdoor re-activation attack that manipulates the original trigger with a well-designed perturbation using universal adversarial attacks. The effectiveness of this method is demonstrated in both white-box and black-box scenarios, where the adversary can only query the model during inference. The proposed methods are evaluated on image classification and multimodal contrastive learning tasks, showing significant improvements in attack success rates compared to defense models. The study highlights a critical vulnerability in existing defense strategies and emphasizes the need for more robust and advanced backdoor defense mechanisms.