FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning
This paper investigates how to leverage pre-trained visual-language models (VLMs) for online reinforcement learning (RL), particularly in sparse reward tasks with predefined textual task descriptions. The authors identify the problem of reward misalignment when using VLMs as rewards in RL tasks. To address this, they introduce a lightweight fine-tuning method called FuRL, which combines reward alignment and relay RL. The method enhances the performance of SAC/DrQ baseline agents on sparse reward tasks by fine-tuning VLM representations and using relay RL to avoid local minima. The authors demonstrate the efficacy of FuRL through extensive experiments on the Meta-world benchmark tasks.
The paper discusses the challenges of using VLMs as rewards in RL, highlighting the issue of reward misalignment. The authors propose FuRL, a method that addresses this issue by using reward alignment and relay RL. The method is evaluated against various baselines and ablation studies are conducted to show the importance of addressing the fuzzy reward issue.
The paper also discusses the use of VLMs in RL, including their role as reward functions, success detectors, and representation models. The authors propose FuRL, which utilizes VLM rewards to facilitate learning in sparse reward tasks while addressing the inherent fuzziness of these rewards through two mechanisms: reward alignment and relay RL.
The authors conduct experiments on the Meta-world benchmark tasks, demonstrating that FuRL outperforms other baselines in most tasks. The results show that FuRL is effective with pixel-based observations and can generalize to other VLM backbone models. The paper also discusses the impact of the VLM reward weight parameter and the importance of reward alignment and relay RL in mitigating the issues of fuzzy VLM rewards.
The paper concludes that the proposed method, FuRL, effectively addresses the issue of reward misalignment in using VLMs as rewards in RL. The authors suggest future work in training the reward alignment module across multiple tasks and applying the proposed approach to more complex compositional language instructions. The paper also highlights the potential societal impact of using VLMs in RL, including the risk of improper language instructions leading to dangerous behaviors. The authors suggest using rule-based keyword blacklists to filter dangerous language instructions or further fine-tuning the trained policy to learn safety knowledge.FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning
This paper investigates how to leverage pre-trained visual-language models (VLMs) for online reinforcement learning (RL), particularly in sparse reward tasks with predefined textual task descriptions. The authors identify the problem of reward misalignment when using VLMs as rewards in RL tasks. To address this, they introduce a lightweight fine-tuning method called FuRL, which combines reward alignment and relay RL. The method enhances the performance of SAC/DrQ baseline agents on sparse reward tasks by fine-tuning VLM representations and using relay RL to avoid local minima. The authors demonstrate the efficacy of FuRL through extensive experiments on the Meta-world benchmark tasks.
The paper discusses the challenges of using VLMs as rewards in RL, highlighting the issue of reward misalignment. The authors propose FuRL, a method that addresses this issue by using reward alignment and relay RL. The method is evaluated against various baselines and ablation studies are conducted to show the importance of addressing the fuzzy reward issue.
The paper also discusses the use of VLMs in RL, including their role as reward functions, success detectors, and representation models. The authors propose FuRL, which utilizes VLM rewards to facilitate learning in sparse reward tasks while addressing the inherent fuzziness of these rewards through two mechanisms: reward alignment and relay RL.
The authors conduct experiments on the Meta-world benchmark tasks, demonstrating that FuRL outperforms other baselines in most tasks. The results show that FuRL is effective with pixel-based observations and can generalize to other VLM backbone models. The paper also discusses the impact of the VLM reward weight parameter and the importance of reward alignment and relay RL in mitigating the issues of fuzzy VLM rewards.
The paper concludes that the proposed method, FuRL, effectively addresses the issue of reward misalignment in using VLMs as rewards in RL. The authors suggest future work in training the reward alignment module across multiple tasks and applying the proposed approach to more complex compositional language instructions. The paper also highlights the potential societal impact of using VLMs in RL, including the risk of improper language instructions leading to dangerous behaviors. The authors suggest using rule-based keyword blacklists to filter dangerous language instructions or further fine-tuning the trained policy to learn safety knowledge.