2024 | Jie Cheng, Gang Xiong, Xingyuan Dai, Qinghai Miao, Yisheng Lv, Fei-Yue Wang
RIME is a robust preference-based reinforcement learning (PbRL) algorithm designed to effectively learn rewards from noisy preferences. Unlike previous methods that focus on feedback efficiency, RIME prioritizes robustness by using a sample selection-based discriminator to dynamically filter out noisy preferences. To mitigate accumulated errors from incorrect selection, RIME employs a warm-start method for the reward model, which also helps bridge the performance gap during the transition from pre-training to online training. The algorithm introduces a denoising discriminator that uses dynamic lower and predefined upper bounds on the Kullback-Leibler (KL) divergence between predicted and annotated preference labels to filter trustworthy samples and flip unreliable labels. This approach enhances the robustness of the state-of-the-art PbRL method across various robotic manipulation and locomotion tasks. Experiments show that RIME significantly outperforms existing baselines under noisy preference conditions, demonstrating its effectiveness in real-world applications. The warm-start method is crucial for both robustness and feedback efficiency, and RIME is particularly suitable for non-expert human feedback. The algorithm's ability to handle noisy preferences and its adaptability to distribution shifts make it a promising solution for improving the robustness of PbRL in practical scenarios.RIME is a robust preference-based reinforcement learning (PbRL) algorithm designed to effectively learn rewards from noisy preferences. Unlike previous methods that focus on feedback efficiency, RIME prioritizes robustness by using a sample selection-based discriminator to dynamically filter out noisy preferences. To mitigate accumulated errors from incorrect selection, RIME employs a warm-start method for the reward model, which also helps bridge the performance gap during the transition from pre-training to online training. The algorithm introduces a denoising discriminator that uses dynamic lower and predefined upper bounds on the Kullback-Leibler (KL) divergence between predicted and annotated preference labels to filter trustworthy samples and flip unreliable labels. This approach enhances the robustness of the state-of-the-art PbRL method across various robotic manipulation and locomotion tasks. Experiments show that RIME significantly outperforms existing baselines under noisy preference conditions, demonstrating its effectiveness in real-world applications. The warm-start method is crucial for both robustness and feedback efficiency, and RIME is particularly suitable for non-expert human feedback. The algorithm's ability to handle noisy preferences and its adaptability to distribution shifts make it a promising solution for improving the robustness of PbRL in practical scenarios.