Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning
This paper introduces Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users fine-tuning large language models (LLMs). The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at https://github.com/git-disl/Vaccine.
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for LLMs: a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a harmful embedding drift phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at https://github.com/git-disl/Vaccine.
The alignment techniques usually include supervised fine-tuning (SFT) on a safe demonstration dataset. Via this channel, an LLM learns how to react to human instruction in a harmless and helpful way, as demonstrated in the alignment dataset. However, user finetuning service poses serious challenges for service providers to sustain truthful and responsible service, because in the most common business models, users can upload arbitrary demonstration data with a particular format to the service provider for finetuning. Supervised finetuning on these data may break the alignment with a small amount of harmful data that is mixed into the benign fine-tuning data. Unfortunately, it is almost impossible to either manually use a filter to detect and remove all the harmful data during finetuning, or heal the model simply by restraining the model update in finetuning stage into a subspace. This vulnerability poses a serious threat to the service provider, who is liable for the potentially harmful output of the customized model after finetuning on the user data.
To mitigate such a security risk in the fine-tuning stage, one approach is to apply two categories of general solutions originally proposed to counter "catastrophic forgetting" in the field of continual learning.Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning
This paper introduces Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users fine-tuning large language models (LLMs). The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at https://github.com/git-disl/Vaccine.
The new paradigm of fine-tuning-as-a-service introduces a new attack surface for LLMs: a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a harmful embedding drift phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at https://github.com/git-disl/Vaccine.
The alignment techniques usually include supervised fine-tuning (SFT) on a safe demonstration dataset. Via this channel, an LLM learns how to react to human instruction in a harmless and helpful way, as demonstrated in the alignment dataset. However, user finetuning service poses serious challenges for service providers to sustain truthful and responsible service, because in the most common business models, users can upload arbitrary demonstration data with a particular format to the service provider for finetuning. Supervised finetuning on these data may break the alignment with a small amount of harmful data that is mixed into the benign fine-tuning data. Unfortunately, it is almost impossible to either manually use a filter to detect and remove all the harmful data during finetuning, or heal the model simply by restraining the model update in finetuning stage into a subspace. This vulnerability poses a serious threat to the service provider, who is liable for the potentially harmful output of the customized model after finetuning on the user data.
To mitigate such a security risk in the fine-tuning stage, one approach is to apply two categories of general solutions originally proposed to counter "catastrophic forgetting" in the field of continual learning.