Understanding RIME%3A Robust Preference-based Reinforcement Learning with Noisy Preferences

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences **Abstract:** Preference-based Reinforcement Learning (PbRL) leverages human preferences as the reward signal, bypassing the need for reward engineering. However, current PbRL methods heavily rely on high-quality feedback from domain experts, leading to a lack of robustness. This paper introduces RIME, a robust PbRL algorithm designed to effectively learn rewards from noisy preferences. RIME employs a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To address the cumulative error from incorrect selection, RIME suggests a warm start for the reward model, bridging the performance gap during the transition from pre-training to online training. Experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of state-of-the-art PbRL methods. The code is available at <https://github.com/CJReinforce/RIME_ICML2024>. Reinforcement Learning (RL) has shown remarkable performance in various domains, including gameplay, robotics, and autonomous systems. PbRL, which uses human preferences as the reward signal, aims to eliminate the need for manually designed reward functions. However, existing PbRL methods primarily focus on enhancing feedback efficiency, often relying heavily on high-quality feedback from domain experts. This reliance can be problematic when dealing with noisy or inconsistent feedback from non-expert users or crowd-sourcing platforms. RIME addresses this issue by improving robustness to noisy preferences. RIME consists of two main components: a denoising discriminator and a warm start method. The denoising discriminator dynamically filters out corrupted preferences by setting dynamic lower and upper bounds on the Kullback-Leibler (KL) divergence between predicted and annotated preference labels. The warm start method pre-trains the reward model using intrinsic rewards to facilitate a smooth transition from pre-training to online training, enhancing the initial capability of the denoising discriminator and reducing accumulated errors. RIME is evaluated on six complex tasks, including robotic manipulation tasks from Meta-world and locomotion tasks from DMControl. The results show that RIME significantly outperforms existing PbRL baselines under noisy preference conditions, demonstrating its effectiveness in enhancing robustness. Ablation studies further validate the importance of the warm-start approach and the dynamic threshold in the denoising discriminator. Overall, RIME shows promise in broadening the applicability of PbRL by leveraging preferences from non-expert users or crowd-sourcing platforms.RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences **Abstract:** Preference-based Reinforcement Learning (PbRL) leverages human preferences as the reward signal, bypassing the need for reward engineering. However, current PbRL methods heavily rely on high-quality feedback from domain experts, leading to a lack of robustness. This paper introduces RIME, a robust PbRL algorithm designed to effectively learn rewards from noisy preferences. RIME employs a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To address the cumulative error from incorrect selection, RIME suggests a warm start for the reward model, bridging the performance gap during the transition from pre-training to online training. Experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of state-of-the-art PbRL methods. The code is available at <https://github.com/CJReinforce/RIME_ICML2024>. Reinforcement Learning (RL) has shown remarkable performance in various domains, including gameplay, robotics, and autonomous systems. PbRL, which uses human preferences as the reward signal, aims to eliminate the need for manually designed reward functions. However, existing PbRL methods primarily focus on enhancing feedback efficiency, often relying heavily on high-quality feedback from domain experts. This reliance can be problematic when dealing with noisy or inconsistent feedback from non-expert users or crowd-sourcing platforms. RIME addresses this issue by improving robustness to noisy preferences. RIME consists of two main components: a denoising discriminator and a warm start method. The denoising discriminator dynamically filters out corrupted preferences by setting dynamic lower and upper bounds on the Kullback-Leibler (KL) divergence between predicted and annotated preference labels. The warm start method pre-trains the reward model using intrinsic rewards to facilitate a smooth transition from pre-training to online training, enhancing the initial capability of the denoising discriminator and reducing accumulated errors. RIME is evaluated on six complex tasks, including robotic manipulation tasks from Meta-world and locomotion tasks from DMControl. The results show that RIME significantly outperforms existing PbRL baselines under noisy preference conditions, demonstrating its effectiveness in enhancing robustness. Ablation studies further validate the importance of the warm-start approach and the dynamic threshold in the denoising discriminator. Overall, RIME shows promise in broadening the applicability of PbRL by leveraging preferences from non-expert users or crowd-sourcing platforms.

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences

2024 | Jie Cheng, Gang Xiong, Xingyuan Dai, Qinghai Miao, Yisheng Lv, Fei-Yue Wang