BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models

24 Jun 2024 | Yi Zeng*, Weiyu Sun*, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia
BEEAR is a novel method for mitigating safety backdoors in instruction-tuned large language models (LLMs). Safety backdoor attacks enable LLMs to behave safely during normal interactions but activate harmful behaviors when triggered. BEEAR leverages the insight that backdoor triggers induce uniform drifts in the model's embedding space. It uses a bi-level optimization approach to identify universal embedding perturbations that elicit unwanted behaviors and adjust model parameters to reinforce safe behaviors. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from over 95% to less than 1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. BEEAR requires only defender-defined safe and unwanted behaviors and represents a step towards practical defenses against safety backdoors in LLMs. The method is effective across eight different backdoor attack settings, demonstrating strong performance in mitigating backdoor effects while maintaining model helpfulness. BEEAR is versatile and can be applied to any model regardless of known vulnerabilities, making it a valuable tool for mitigating risks posed by backdoored models. The paper also discusses limitations, ethical considerations, and future directions for research in AI safety and security.BEEAR is a novel method for mitigating safety backdoors in instruction-tuned large language models (LLMs). Safety backdoor attacks enable LLMs to behave safely during normal interactions but activate harmful behaviors when triggered. BEEAR leverages the insight that backdoor triggers induce uniform drifts in the model's embedding space. It uses a bi-level optimization approach to identify universal embedding perturbations that elicit unwanted behaviors and adjust model parameters to reinforce safe behaviors. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from over 95% to less than 1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. BEEAR requires only defender-defined safe and unwanted behaviors and represents a step towards practical defenses against safety backdoors in LLMs. The method is effective across eight different backdoor attack settings, demonstrating strong performance in mitigating backdoor effects while maintaining model helpfulness. BEEAR is versatile and can be applied to any model regardless of known vulnerabilities, making it a valuable tool for mitigating risks posed by backdoored models. The paper also discusses limitations, ethical considerations, and future directions for research in AI safety and security.
Reach us at info@study.space
Understanding BEEAR%3A Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models