20 Aug 2024 | Luxi He*, Mengzhou Xia*, Peter Henderson
The paper explores the phenomenon of benign fine-tuning breaking model safety and alignment from a data-centric perspective. It introduces representation and gradient-based methods to identify subsets of benign data that, when fine-tuned, can significantly degrade model safety. The authors find that fine-tuning on 100 selected benign examples can increase the Attack Success Rate (ASR) from 13% to 71% in the ALPACA dataset and from 8.2% to 53.3% in the DOLLY dataset. The selected benign data often appear in list, bullet point, or mathematical formats, indicating a systematic pattern contributing to jailbreaking. The paper also discusses the effectiveness of bidirectional anchoring in properly ranking data based on their likelihood of degrading safety. The methods are validated through experiments on various datasets and models, demonstrating their ability to identify harmful data and transfer their effects to other models. The findings highlight the need for more systematic data selection approaches to improve safety and utility in fine-tuning processes.The paper explores the phenomenon of benign fine-tuning breaking model safety and alignment from a data-centric perspective. It introduces representation and gradient-based methods to identify subsets of benign data that, when fine-tuned, can significantly degrade model safety. The authors find that fine-tuning on 100 selected benign examples can increase the Attack Success Rate (ASR) from 13% to 71% in the ALPACA dataset and from 8.2% to 53.3% in the DOLLY dataset. The selected benign data often appear in list, bullet point, or mathematical formats, indicating a systematic pattern contributing to jailbreaking. The paper also discusses the effectiveness of bidirectional anchoring in properly ranking data based on their likelihood of degrading safety. The methods are validated through experiments on various datasets and models, demonstrating their ability to identify harmful data and transfer their effects to other models. The findings highlight the need for more systematic data selection approaches to improve safety and utility in fine-tuning processes.