Understanding BEEAR%3A Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

This paper addresses the critical challenge of safety backdoor attacks in large language models (LLMs), which can trigger unsafe behaviors while evading detection during normal interactions. The authors propose BEEAR (Backdoor Embedding Entrapment and Adversarial Removal), a novel mitigation approach that leverages the insight that backdoor triggers induce uniform drifts in the model's embedding space. BEEAR employs a bi-level optimization method to identify universal embedding perturbations that elicit unwanted behaviors and adjust the model parameters to reinforce safe behaviors against these perturbations. Experiments demonstrate that BEEAR significantly reduces the success rate of RLHF time backdoor attacks from over 95% to less than 1%, and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. BEEAR represents a practical defense against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.This paper addresses the critical challenge of safety backdoor attacks in large language models (LLMs), which can trigger unsafe behaviors while evading detection during normal interactions. The authors propose BEEAR (Backdoor Embedding Entrapment and Adversarial Removal), a novel mitigation approach that leverages the insight that backdoor triggers induce uniform drifts in the model's embedding space. BEEAR employs a bi-level optimization method to identify universal embedding perturbations that elicit unwanted behaviors and adjust the model parameters to reinforce safe behaviors against these perturbations. Experiments demonstrate that BEEAR significantly reduces the success rate of RLHF time backdoor attacks from over 95% to less than 1%, and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. BEEAR represents a practical defense against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

24 Jun 2024 | Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia