May 28, 2024 | ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau
This paper explores the risks of finetuning large language models (LLMs) and introduces the VISAGE safety metric to measure the safety of LLMs during finetuning. The authors discover a new phenomenon called "safety basin," which is a region in the model parameter space where the safety level of an aligned model remains stable even when model weights are randomly perturbed. This finding leads to the development of the VISAGE safety metric, which evaluates the safety of an LLM by analyzing its safety landscape. The safety landscape visualization helps understand how finetuning can compromise safety by moving the model away from the safety basin. The study also highlights the critical role of system prompts in protecting LLMs and shows that safety is preserved in perturbed variants within the safety basin. The research demonstrates that even small amounts of adversarial training can break LLM safety alignment, and that finetuning with a mix of harmful and safe data can help maintain safety. The VISAGE metric is shown to be a reliable indicator of LLM safety, and the study provides new insights for future work on LLM safety and defense. The paper also evaluates the impact of system prompts on LLM safety and shows that removing the default system prompt or using roleplaying prompts can jeopardize safety alignment. The study further shows that jailbreaking attacks are sensitive to model weight perturbations, and that perturbed models can be significantly safer than the original aligned model. Overall, the research provides a comprehensive understanding of LLM safety risks and offers new tools for evaluating and improving LLM safety.This paper explores the risks of finetuning large language models (LLMs) and introduces the VISAGE safety metric to measure the safety of LLMs during finetuning. The authors discover a new phenomenon called "safety basin," which is a region in the model parameter space where the safety level of an aligned model remains stable even when model weights are randomly perturbed. This finding leads to the development of the VISAGE safety metric, which evaluates the safety of an LLM by analyzing its safety landscape. The safety landscape visualization helps understand how finetuning can compromise safety by moving the model away from the safety basin. The study also highlights the critical role of system prompts in protecting LLMs and shows that safety is preserved in perturbed variants within the safety basin. The research demonstrates that even small amounts of adversarial training can break LLM safety alignment, and that finetuning with a mix of harmful and safe data can help maintain safety. The VISAGE metric is shown to be a reliable indicator of LLM safety, and the study provides new insights for future work on LLM safety and defense. The paper also evaluates the impact of system prompts on LLM safety and shows that removing the default system prompt or using roleplaying prompts can jeopardize safety alignment. The study further shows that jailbreaking attacks are sensitive to model weight perturbations, and that perturbed models can be significantly safer than the original aligned model. Overall, the research provides a comprehensive understanding of LLM safety risks and offers new tools for evaluating and improving LLM safety.