Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

2025-01-17 | Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora
This paper investigates the critical role of prompt templates in preserving safety alignment after fine-tuning large language models (LLMs). The authors propose the "Pure Tuning, Safe Testing" (PTST) strategy, which involves fine-tuning models without a safety prompt but including it during testing. This approach helps maintain safety alignment while improving performance on downstream tasks. The study shows that using the same prompt template during fine-tuning and testing can lead to a significant loss of safety, as demonstrated by increased attack success rates (ASR) on harmful queries. In contrast, using different prompt templates during fine-tuning and testing significantly reduces ASR, with PTST being the most effective strategy. The research also explores the impact of adding safety examples during fine-tuning and finds that while they can reduce ASR, they are not sufficient to prevent safety degradation in all cases. The findings highlight the importance of prompt templates in maintaining safety alignment after fine-tuning and suggest that PTST is a promising approach for preserving safety in LLMs. The study was conducted on several chat models, including Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo, and the results show that PTST significantly reduces the rise of unsafe behaviors. The paper also discusses the broader implications of these findings for the development and deployment of aligned LLMs.This paper investigates the critical role of prompt templates in preserving safety alignment after fine-tuning large language models (LLMs). The authors propose the "Pure Tuning, Safe Testing" (PTST) strategy, which involves fine-tuning models without a safety prompt but including it during testing. This approach helps maintain safety alignment while improving performance on downstream tasks. The study shows that using the same prompt template during fine-tuning and testing can lead to a significant loss of safety, as demonstrated by increased attack success rates (ASR) on harmful queries. In contrast, using different prompt templates during fine-tuning and testing significantly reduces ASR, with PTST being the most effective strategy. The research also explores the impact of adding safety examples during fine-tuning and finds that while they can reduce ASR, they are not sufficient to prevent safety degradation in all cases. The findings highlight the importance of prompt templates in maintaining safety alignment after fine-tuning and suggest that PTST is a promising approach for preserving safety in LLMs. The study was conducted on several chat models, including Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo, and the results show that PTST significantly reduces the rise of unsafe behaviors. The paper also discusses the broader implications of these findings for the development and deployment of aligned LLMs.
Reach us at info@study.space
[slides] Keeping LLMs Aligned After Fine-tuning%3A The Crucial Role of Prompt Templates | StudySpace