[slides and audio] Learning to Poison Large Language Models During Instruction Tuning

This paper addresses the vulnerability of Large Language Models (LLMs) to data poisoning attacks during instruction tuning. The authors propose a novel gradient-guided backdoor trigger learning approach to efficiently identify adversarial triggers, ensuring evasion of conventional defenses while maintaining content integrity. Through experiments across various LLMs and tasks, the study demonstrates that only 1% of 4,000 instruction tuning samples can lead to a Performance Drop Rate (PDR) of around 80%. The work highlights the need for stronger defenses against data poisoning attacks and offers insights into safeguarding LLMs against sophisticated threats. The source code is available on a GitHub repository.This paper addresses the vulnerability of Large Language Models (LLMs) to data poisoning attacks during instruction tuning. The authors propose a novel gradient-guided backdoor trigger learning approach to efficiently identify adversarial triggers, ensuring evasion of conventional defenses while maintaining content integrity. Through experiments across various LLMs and tasks, the study demonstrates that only 1% of 4,000 instruction tuning samples can lead to a Performance Drop Rate (PDR) of around 80%. The work highlights the need for stronger defenses against data poisoning attacks and offers insights into safeguarding LLMs against sophisticated threats. The source code is available on a GitHub repository.

Learning to Poison Large Language Models During Instruction Tuning

21 Feb 2024 | Xiangyu Zhou* and Yao Qiang* and Saleh Zare Zade and Mohammad Amin Roshani Douglas Zytko and Dongxiao Zhu