Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

31 May 2024 | Xiaoqun Liu, Jiacheng Liang, Muchao Ye, Zhaohan Xi
This paper proposes a data curation framework called CTRL to enhance the safety alignment of large language models (LLMs) by mitigating the risks of training-based jailbreaking attacks. The framework aims to neutralize the impact of harmful texts in pre-training datasets and increase the difficulty of jailbreaking during downstream fine-tuning. CTRL operates under the assumption that no prior knowledge of attack details is available, focusing solely on curating clean texts. The method reduces the perplexity of texts as perceived by LLMs while preserving their quality. By pre-training or fine-tuning LLMs with curated clean texts, the framework improves LLM robustness against harmful queries. For instance, when pre-training LLMs using a crowdsourced dataset containing 5% harmful instances, adding an equivalent amount of curated texts significantly mitigates the likelihood of harmful responses and reduces the attack success rate by 71%. The study demonstrates that CTRL effectively reduces adversarial efforts, enhancing the safety and helpfulness of LLMs. The framework is evaluated on multiple LLMs and datasets, showing its effectiveness in both pre-training and fine-tuning scenarios. The results indicate that CTRL significantly mitigates jailbreaking attacks, enhances LLM safety alignment, and improves the helpfulness of LLMs. The study highlights the importance of data curation in strengthening LLMs against training-based jailbreaking and ensures their secure utilization.This paper proposes a data curation framework called CTRL to enhance the safety alignment of large language models (LLMs) by mitigating the risks of training-based jailbreaking attacks. The framework aims to neutralize the impact of harmful texts in pre-training datasets and increase the difficulty of jailbreaking during downstream fine-tuning. CTRL operates under the assumption that no prior knowledge of attack details is available, focusing solely on curating clean texts. The method reduces the perplexity of texts as perceived by LLMs while preserving their quality. By pre-training or fine-tuning LLMs with curated clean texts, the framework improves LLM robustness against harmful queries. For instance, when pre-training LLMs using a crowdsourced dataset containing 5% harmful instances, adding an equivalent amount of curated texts significantly mitigates the likelihood of harmful responses and reduces the attack success rate by 71%. The study demonstrates that CTRL effectively reduces adversarial efforts, enhancing the safety and helpfulness of LLMs. The framework is evaluated on multiple LLMs and datasets, showing its effectiveness in both pre-training and fine-tuning scenarios. The results indicate that CTRL significantly mitigates jailbreaking attacks, enhances LLM safety alignment, and improves the helpfulness of LLMs. The study highlights the importance of data curation in strengthening LLMs against training-based jailbreaking and ensures their secure utilization.
Reach us at info@study.space