This paper investigates the use of pre-trained visual-language models (VLMs) in online Reinforcement Learning (RL) for sparse reward tasks with predefined textual task descriptions. The authors identify and address the issue of reward misalignment when using VLMs as rewards in RL, proposing a lightweight fine-tuning method called Fuzzy VLM reward-aided RL (FuRL). FuRL enhances the performance of SAC/DrQ baseline agents by fine-tuning VLM representations and using relay RL to avoid local minima. Extensive experiments on the Meta-world benchmark tasks demonstrate the effectiveness of FuRL. The main contributions include:
1. **Problem Identification**: The authors highlight the issue of reward misalignment when using VLMs as rewards in RL.
2. **FuRL Introduction**: They introduce FuRL, a method that combines reward alignment and relay RL to improve exploration and policy learning.
3. **Methodology**: FuRL freezes the pre-trained VLM and fine-tunes two MLP-based projection heads to improve the VLM rewards. Relay RL helps escape local minima and collects more diverse data.
4. **Experiments**: FuRL is evaluated on various tasks, showing superior performance compared to baselines. Ablation studies validate the effectiveness of each component.
5. **Conclusion**: The paper concludes by discussing future directions, including joint training of reward alignment modules across multiple tasks and applying the method to more complex language instructions.
The paper also discusses potential societal impacts, such as the risk of improper language instructions leading to dangerous behaviors, and suggests measures to mitigate these risks.This paper investigates the use of pre-trained visual-language models (VLMs) in online Reinforcement Learning (RL) for sparse reward tasks with predefined textual task descriptions. The authors identify and address the issue of reward misalignment when using VLMs as rewards in RL, proposing a lightweight fine-tuning method called Fuzzy VLM reward-aided RL (FuRL). FuRL enhances the performance of SAC/DrQ baseline agents by fine-tuning VLM representations and using relay RL to avoid local minima. Extensive experiments on the Meta-world benchmark tasks demonstrate the effectiveness of FuRL. The main contributions include:
1. **Problem Identification**: The authors highlight the issue of reward misalignment when using VLMs as rewards in RL.
2. **FuRL Introduction**: They introduce FuRL, a method that combines reward alignment and relay RL to improve exploration and policy learning.
3. **Methodology**: FuRL freezes the pre-trained VLM and fine-tunes two MLP-based projection heads to improve the VLM rewards. Relay RL helps escape local minima and collects more diverse data.
4. **Experiments**: FuRL is evaluated on various tasks, showing superior performance compared to baselines. Ablation studies validate the effectiveness of each component.
5. **Conclusion**: The paper concludes by discussing future directions, including joint training of reward alignment modules across multiple tasks and applying the method to more complex language instructions.
The paper also discusses potential societal impacts, such as the risk of improper language instructions leading to dangerous behaviors, and suggests measures to mitigate these risks.