Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment

20 Jun 2024 | Jiongxiao Wang, Jiazhaol Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, Chaowei Xiao
This paper introduces a novel method called Backdoor Enhanced Safety Alignment to mitigate Fine-tuning based Jailbreak Attack (FJAttack) in the Language-Model-as-a-Service (LMaaS) setting. FJAttack is a threat where malicious fine-tuning with harmful examples can compromise the safety alignment of large language models (LLMs). The proposed method draws an analogy with backdoor attacks, where a small number of poisoned examples with a secret prompt are used to establish a strong correlation between the secret prompt and safe responses. By integrating prefixed safety examples with a secret prompt into the fine-tuning dataset, the model learns to associate the secret prompt with safe responses, ensuring that when the secret prompt is prepended to user inputs during inference, the model generates safe answers to harmful questions. The method is effective even with a limited number of safety examples. Experiments show that adding as few as 11 prefixed safety examples can significantly reduce the Attack Success Rate (ASR) and improve safety performance compared to baseline methods. The method also preserves the model's utility for benign tasks. The approach is tested on various benchmarks, including ARC-Challenge, MMLU, and MT-bench, demonstrating its effectiveness in maintaining safety and performance. The paper also explores the method's effectiveness in practical scenarios where users upload task-specific data for fine-tuning. It shows that the method can effectively defend against FJAttack even when the fine-tuning data includes both harmful examples and task-specific data. The method is also tested against other types of attacks, such as the Identity Role Shift Attack, and is found to be effective. The paper highlights the importance of addressing FJAttack in LMaaS settings, where users have the ability to fine-tune models with their own data. The proposed method provides a practical solution to this challenge, ensuring that models remain safe and effective even after fine-tuning. The results demonstrate that the Backdoor Enhanced Safety Alignment method is a promising approach for mitigating FJAttack and enhancing the robustness of LLMs against fine-tuning vulnerabilities.This paper introduces a novel method called Backdoor Enhanced Safety Alignment to mitigate Fine-tuning based Jailbreak Attack (FJAttack) in the Language-Model-as-a-Service (LMaaS) setting. FJAttack is a threat where malicious fine-tuning with harmful examples can compromise the safety alignment of large language models (LLMs). The proposed method draws an analogy with backdoor attacks, where a small number of poisoned examples with a secret prompt are used to establish a strong correlation between the secret prompt and safe responses. By integrating prefixed safety examples with a secret prompt into the fine-tuning dataset, the model learns to associate the secret prompt with safe responses, ensuring that when the secret prompt is prepended to user inputs during inference, the model generates safe answers to harmful questions. The method is effective even with a limited number of safety examples. Experiments show that adding as few as 11 prefixed safety examples can significantly reduce the Attack Success Rate (ASR) and improve safety performance compared to baseline methods. The method also preserves the model's utility for benign tasks. The approach is tested on various benchmarks, including ARC-Challenge, MMLU, and MT-bench, demonstrating its effectiveness in maintaining safety and performance. The paper also explores the method's effectiveness in practical scenarios where users upload task-specific data for fine-tuning. It shows that the method can effectively defend against FJAttack even when the fine-tuning data includes both harmful examples and task-specific data. The method is also tested against other types of attacks, such as the Identity Role Shift Attack, and is found to be effective. The paper highlights the importance of addressing FJAttack in LMaaS settings, where users have the ability to fine-tune models with their own data. The proposed method provides a practical solution to this challenge, ensuring that models remain safe and effective even after fine-tuning. The results demonstrate that the Backdoor Enhanced Safety Alignment method is a promising approach for mitigating FJAttack and enhancing the robustness of LLMs against fine-tuning vulnerabilities.
Reach us at info@study.space
[slides] Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | StudySpace