Understanding Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

This paper addresses the challenge of defending against Fine-tuning based Jailbreak Attack (FJAttack) in Language-Model-as-a-Service (LMaaS) settings, where a small number of harmful examples can compromise the safety alignment of large language models (LLMs). The authors propose a method called Backdoor Enhanced Safety Alignment, inspired by backdoor attacks, to mitigate FJAttack with limited safety examples. The method involves constructing prefixed safety examples with a secret prompt, which acts as a "backdoor trigger." By integrating these prefixed safety examples into the fine-tuning dataset, the fine-tuning process effectively establishes a strong correlation between the secret prompt and safe responses. During inference, service providers can prepend this secret prompt to user inputs, ensuring safe responses for harmful questions without affecting benign responses. Extensive experiments demonstrate that adding as few as 11 prefixed safety examples can significantly reduce the Attack Success Rate (ASR) and Harmfulness Score, while maintaining benign task performance. The method is also effective in practical scenarios where fine-tuning data includes both FJAttack examples and task-specific data, showing its broad applicability and robustness against fine-tuning vulnerabilities.This paper addresses the challenge of defending against Fine-tuning based Jailbreak Attack (FJAttack) in Language-Model-as-a-Service (LMaaS) settings, where a small number of harmful examples can compromise the safety alignment of large language models (LLMs). The authors propose a method called Backdoor Enhanced Safety Alignment, inspired by backdoor attacks, to mitigate FJAttack with limited safety examples. The method involves constructing prefixed safety examples with a secret prompt, which acts as a "backdoor trigger." By integrating these prefixed safety examples into the fine-tuning dataset, the fine-tuning process effectively establishes a strong correlation between the secret prompt and safe responses. During inference, service providers can prepend this secret prompt to user inputs, ensuring safe responses for harmful questions without affecting benign responses. Extensive experiments demonstrate that adding as few as 11 prefixed safety examples can significantly reduce the Attack Success Rate (ASR) and Harmfulness Score, while maintaining benign task performance. The method is also effective in practical scenarios where fine-tuning data includes both FJAttack examples and task-specific data, showing its broad applicability and robustness against fine-tuning vulnerabilities.

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

20 Jun 2024 | Jiong Xiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhaao Chen, Bo Li, Chaowei Xiao