[slides and audio] Vaccine%3A Perturbation-aware Alignment for Large Language Model

The paper "Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning" addresses the security risk introduced by fine-tuning-as-a-service, where a few harmful data uploaded by users can easily break the alignment of Large Language Models (LLMs). The authors conduct empirical analysis and identify a *harmful embedding drift* phenomenon, which is the primary cause of the alignment-breaking effect. They propose Vaccine, a perturbation-aware alignment technique that enhances the robustness of LLMs against harmful fine-tuning. Vaccine works by adding crafted perturbations to the hidden embeddings during the alignment phase, making them invariant to harmful perturbations from user data during fine-tuning. Experiments on popular LLMs (e.g., Llama2, Opt, Vicuna) show that Vaccine significantly reduces the harmful score (up to 9.8%) while maintaining good performance on benign prompts (with a negligible loss of up to 1.8%). The paper also includes a detailed methodology, implementation details, and extensive evaluations, demonstrating the effectiveness of Vaccine in mitigating the security risks associated with fine-tuning.The paper "Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning" addresses the security risk introduced by fine-tuning-as-a-service, where a few harmful data uploaded by users can easily break the alignment of Large Language Models (LLMs). The authors conduct empirical analysis and identify a *harmful embedding drift* phenomenon, which is the primary cause of the alignment-breaking effect. They propose Vaccine, a perturbation-aware alignment technique that enhances the robustness of LLMs against harmful fine-tuning. Vaccine works by adding crafted perturbations to the hidden embeddings during the alignment phase, making them invariant to harmful perturbations from user data during fine-tuning. Experiments on popular LLMs (e.g., Llama2, Opt, Vicuna) show that Vaccine significantly reduces the harmful score (up to 9.8%) while maintaining good performance on benign prompts (with a negligible loss of up to 1.8%). The paper also includes a detailed methodology, implementation details, and extensive evaluations, demonstrating the effectiveness of Vaccine in mitigating the security risks associated with fine-tuning.

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning

22 Aug 2024 | Tiansheng Huang, Sihao Hu, Ling Liu