Understanding What is in Your Safe Data%3F Identifying Benign Data that Breaks Safety

This paper investigates why benign data, when used for fine-tuning, can lead to safety degradation in large language models (LLMs). The study finds that certain seemingly harmless data points, such as lists, bullet points, and math problems, can significantly increase the likelihood of jailbreaking when used for fine-tuning. The research proposes two methods: representation matching and gradient matching, which identify data points that are similar to harmful examples. These methods help in selecting data that, when used for fine-tuning, can lead to a substantial increase in the attack success rate (ASR) of the model. The study also shows that using just 100 such data points can lead to a significant increase in ASR compared to random data selection. The findings highlight the importance of carefully selecting data for fine-tuning to avoid compromising model safety. The paper also discusses the transferability of these findings to other models and datasets, showing that the selected data can affect the safety of different models. The study emphasizes the need for data-centric approaches in ensuring the safety of LLMs during fine-tuning.This paper investigates why benign data, when used for fine-tuning, can lead to safety degradation in large language models (LLMs). The study finds that certain seemingly harmless data points, such as lists, bullet points, and math problems, can significantly increase the likelihood of jailbreaking when used for fine-tuning. The research proposes two methods: representation matching and gradient matching, which identify data points that are similar to harmful examples. These methods help in selecting data that, when used for fine-tuning, can lead to a substantial increase in the attack success rate (ASR) of the model. The study also shows that using just 100 such data points can lead to a significant increase in ASR compared to random data selection. The findings highlight the importance of carefully selecting data for fine-tuning to avoid compromising model safety. The paper also discusses the transferability of these findings to other models and datasets, showing that the selected data can affect the safety of different models. The study emphasizes the need for data-centric approaches in ensuring the safety of LLMs during fine-tuning.

What is in Your Safe Data? Identifying Benign Data that Breaks Safety

20 Aug 2024 | Luxi He, Mengzhou Xia, Peter Henderson