[slides and audio] Keeping LLMs Aligned After Fine-tuning%3A The Crucial Role of Prompt Templates

The paper "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates" by Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora explores the issue of safety degradation in Large Language Models (LLMs) after fine-tuning. The authors highlight that even when fine-tuning on seemingly benign datasets, models can exhibit unsafe behaviors. They propose the "Pure Tuning, Safe Testing" (PTST) strategy, which involves fine-tuning without a safety prompt but using it during inference. This approach helps preserve safety alignment while maintaining helpfulness on downstream tasks. Through extensive experiments on various chat models, the paper demonstrates that PTST significantly reduces the rise of unsafe behaviors compared to using the same prompt template for both fine-tuning and inference. The study also investigates the effectiveness of different prompt templates and the impact of adding safety examples during fine-tuning. The findings suggest that PTST is a practical and effective method to mitigate safety degradation in fine-tuned LLMs.The paper "Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates" by Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora explores the issue of safety degradation in Large Language Models (LLMs) after fine-tuning. The authors highlight that even when fine-tuning on seemingly benign datasets, models can exhibit unsafe behaviors. They propose the "Pure Tuning, Safe Testing" (PTST) strategy, which involves fine-tuning without a safety prompt but using it during inference. This approach helps preserve safety alignment while maintaining helpfulness on downstream tasks. Through extensive experiments on various chat models, the paper demonstrates that PTST significantly reduces the rise of unsafe behaviors compared to using the same prompt template for both fine-tuning and inference. The study also investigates the effectiveness of different prompt templates and the impact of adding safety examples during fine-tuning. The findings suggest that PTST is a practical and effective method to mitigate safety degradation in fine-tuned LLMs.

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

17 Jan 2025 | Kaifeng Lyu1*, Haoyu Zhaol1, Xinran Gu2+, Dingli Yu1, Anirudh Goyal, Sanjeev Arora1