Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data

6 Aug 2024 | Tim Baumgärtner, Yang Gao, Dana Alon, Donald Metzler
This paper explores the vulnerability of Reinforcement Learning from Human Feedback (RLHF) to preference poisoning attacks. RLHF is a popular method for aligning Language Models (LMs) with human values and preferences by using preference pairs as training data. The authors investigate how malicious actors can manipulate LM generations by injecting poisoned preference pairs into the training datasets. They propose strategies to build these poisoned preference pairs and test their effectiveness on two widely used datasets: Stanford Human Preferences (SHP) and HH-RLHF. The results show that a small amount of poisoned data (1-5% of the original dataset) can significantly influence the LM to generate target entities in desired sentiments. The study also highlights the sensitivity of the Reward Model (RM) to poisoned data and the amplifying effect of Reinforcement Learning (RL) on the poisoning impact. Additionally, the paper discusses defensive strategies, such as separating RM and SFT training data, to mitigate the attack. The findings emphasize the need for better defense mechanisms to protect against preference poisoning attacks in RLHF.This paper explores the vulnerability of Reinforcement Learning from Human Feedback (RLHF) to preference poisoning attacks. RLHF is a popular method for aligning Language Models (LMs) with human values and preferences by using preference pairs as training data. The authors investigate how malicious actors can manipulate LM generations by injecting poisoned preference pairs into the training datasets. They propose strategies to build these poisoned preference pairs and test their effectiveness on two widely used datasets: Stanford Human Preferences (SHP) and HH-RLHF. The results show that a small amount of poisoned data (1-5% of the original dataset) can significantly influence the LM to generate target entities in desired sentiments. The study also highlights the sensitivity of the Reward Model (RM) to poisoned data and the amplifying effect of Reinforcement Learning (RL) on the poisoning impact. Additionally, the paper discusses defensive strategies, such as separating RM and SFT training data, to mitigate the attack. The findings emphasize the need for better defense mechanisms to protect against preference poisoning attacks in RLHF.
Reach us at info@study.space