Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

28 May 2024 | ShengYun Peng, Pin-Yu Chen, Matthew Hull, Duen Horng Chau
The paper "Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models" by ShengYun Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau explores the safety alignment of large language models (LLMs) and the risks associated with finetuning. The authors discover a new phenomenon called the "safety basin," where randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. This discovery leads to the proposal of the VISAGE safety metric, which measures the safety of LLMs by probing their safety landscape. The paper visualizes the safety landscape of aligned models to understand how finetuning compromises safety by moving the model away from the safety basin. It also highlights the critical role of system prompts in protecting models and how this protection transfers to perturbed variants within the safety basin. Additionally, the paper evaluates the impact of different system prompts on LLM safety and discusses the effectiveness of jailbreaking attacks on LLMs, providing insights for future research in LLM safety.The paper "Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models" by ShengYun Peng, Pin-Yu Chen, Matthew Hull, and Duen Horng Chau explores the safety alignment of large language models (LLMs) and the risks associated with finetuning. The authors discover a new phenomenon called the "safety basin," where randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. This discovery leads to the proposal of the VISAGE safety metric, which measures the safety of LLMs by probing their safety landscape. The paper visualizes the safety landscape of aligned models to understand how finetuning compromises safety by moving the model away from the safety basin. It also highlights the critical role of system prompts in protecting models and how this protection transfers to perturbed variants within the safety basin. Additionally, the paper evaluates the impact of different system prompts on LLM safety and discusses the effectiveness of jailbreaking attacks on LLMs, providing insights for future research in LLM safety.
Reach us at info@study.space