21 Feb 2024 | Xiangyu Zhou* and Yao Qiang* and Saleh Zare Zade and Mohammad Amin Roshani Douglas Zytko and Dongxiao Zhu
This paper presents a novel data poisoning attack on large language models (LLMs) during instruction tuning. The attack aims to manipulate LLMs to generate specific malicious responses by introducing backdoor triggers into the training data. The proposed method uses a gradient-guided backdoor trigger learning approach to efficiently identify adversarial triggers that evade detection by conventional defenses while maintaining content integrity. Experimental results show that poisoning only 1% of 4,000 instruction tuning samples leads to a Performance Drop Rate (PDR) of around 80%, demonstrating the effectiveness of the attack. The backdoor triggers are designed to be stealthy, appended at the end of the content without altering its original semantic meaning, making them difficult for filter-based defenses to detect. The attack is evaluated across various LLMs and NLP tasks, showing significant success in compromising model outputs. The study highlights the need for stronger defenses against data poisoning attacks, offering insights into safeguarding LLMs against these more sophisticated threats. The paper also discusses the limitations and risks of the proposed attack, emphasizing the importance of robust defenses to ensure the reliability and security of LLMs in language-based tasks.This paper presents a novel data poisoning attack on large language models (LLMs) during instruction tuning. The attack aims to manipulate LLMs to generate specific malicious responses by introducing backdoor triggers into the training data. The proposed method uses a gradient-guided backdoor trigger learning approach to efficiently identify adversarial triggers that evade detection by conventional defenses while maintaining content integrity. Experimental results show that poisoning only 1% of 4,000 instruction tuning samples leads to a Performance Drop Rate (PDR) of around 80%, demonstrating the effectiveness of the attack. The backdoor triggers are designed to be stealthy, appended at the end of the content without altering its original semantic meaning, making them difficult for filter-based defenses to detect. The attack is evaluated across various LLMs and NLP tasks, showing significant success in compromising model outputs. The study highlights the need for stronger defenses against data poisoning attacks, offering insights into safeguarding LLMs against these more sophisticated threats. The paper also discusses the limitations and risks of the proposed attack, emphasizing the importance of robust defenses to ensure the reliability and security of LLMs in language-based tasks.