Understanding Robustifying Safety-Aligned Large Language Models through Clean Data Curation

The paper "Robustifying Safety-Aligned Large Language Models through Clean Data Curation" addresses the vulnerability of large language models (LLMs) to jailbreaking attacks, which can occur during pre-training or fine-tuning. These attacks involve the integration of harmful content into pre-training datasets or direct tampering with LLMs during fine-tuning. The authors propose a data curation framework called CTRL (Clean Data CuRation) to mitigate these adversarial influences. CTRL operates under the assumption that no prior knowledge of attack details is available, focusing on curating clean texts that reduce the perplexity of responses while preserving their quality. CTRL uses a key metric, perplexity, to guide the curation process. Perplexity measures the preference level of LLMs when generating text, with lower values indicating safer responses. By curating texts with low perplexity, CTRL aims to neutralize the impact of harmful content and enhance the safety alignment of LLMs. The method involves an iterative process where LLMs are prompted to generate revised responses, and the quality of these responses is evaluated using metrics such as readability and helpfulness. Experiments demonstrate that CTRL effectively reduces the likelihood of providing harmful responses, even when pre-training on datasets containing 5% harmful instances. CTRL significantly reduces the attack success rate by 71%, showing its effectiveness in mitigating training-based jailbreaking attacks. The study highlights the importance of data curation in enhancing the robustness and safety of LLMs against adversarial influences.The paper "Robustifying Safety-Aligned Large Language Models through Clean Data Curation" addresses the vulnerability of large language models (LLMs) to jailbreaking attacks, which can occur during pre-training or fine-tuning. These attacks involve the integration of harmful content into pre-training datasets or direct tampering with LLMs during fine-tuning. The authors propose a data curation framework called CTRL (Clean Data CuRation) to mitigate these adversarial influences. CTRL operates under the assumption that no prior knowledge of attack details is available, focusing on curating clean texts that reduce the perplexity of responses while preserving their quality. CTRL uses a key metric, perplexity, to guide the curation process. Perplexity measures the preference level of LLMs when generating text, with lower values indicating safer responses. By curating texts with low perplexity, CTRL aims to neutralize the impact of harmful content and enhance the safety alignment of LLMs. The method involves an iterative process where LLMs are prompted to generate revised responses, and the quality of these responses is evaluated using metrics such as readability and helpfulness. Experiments demonstrate that CTRL effectively reduces the likelihood of providing harmful responses, even when pre-training on datasets containing 5% harmful instances. CTRL significantly reduces the attack success rate by 71%, showing its effectiveness in mitigating training-based jailbreaking attacks. The study highlights the importance of data curation in enhancing the robustness and safety of LLMs against adversarial influences.

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

31 May 2024 | Xiaoqun Liu, Jiacheng Liang, Muchao Ye, Zhaohan Xi