2024 | Tim Baumgartner*, Yang Gao, Dana Alon, Donald Metzler
This paper investigates the vulnerability of Reinforcement Learning from Human Feedback (RLHF) to preference poisoning attacks, where malicious actors inject poisoned preference data to manipulate the behavior of language models (LMs). The study demonstrates that by injecting a small amount of poisoned data (1–5% of the original dataset), it is possible to significantly influence the LM's generation behavior, causing it to produce responses containing a target entity in a desired sentiment (positive or negative). The experiments show that the Reward Model (RM) is highly sensitive to poisoned examples, and that the RL process amplifies these effects, leading to a significant increase in the frequency of desired responses. The findings also highlight the importance of defensive strategies, such as separating RM and LM training data, to mitigate the impact of preference poisoning. The study provides insights into the effectiveness of different poisoning strategies and the factors that influence the success of such attacks. The results suggest that preference poisoning is a serious threat to the safety and reliability of RLHF-based LM training, and that further research is needed to develop robust defense mechanisms against this type of attack.This paper investigates the vulnerability of Reinforcement Learning from Human Feedback (RLHF) to preference poisoning attacks, where malicious actors inject poisoned preference data to manipulate the behavior of language models (LMs). The study demonstrates that by injecting a small amount of poisoned data (1–5% of the original dataset), it is possible to significantly influence the LM's generation behavior, causing it to produce responses containing a target entity in a desired sentiment (positive or negative). The experiments show that the Reward Model (RM) is highly sensitive to poisoned examples, and that the RL process amplifies these effects, leading to a significant increase in the frequency of desired responses. The findings also highlight the importance of defensive strategies, such as separating RM and LM training data, to mitigate the impact of preference poisoning. The study provides insights into the effectiveness of different poisoning strategies and the factors that influence the success of such attacks. The results suggest that preference poisoning is a serious threat to the safety and reliability of RLHF-based LM training, and that further research is needed to develop robust defense mechanisms against this type of attack.