Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

17 May 2024 | Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
This paper proposes a framework for training large vision-language models (VLMs) with reinforcement learning (RL) to enhance their decision-making capabilities in multi-step goal-directed tasks. The framework involves a task description prompt that guides the VLM to generate chain-of-thought (CoT) reasoning, leading to a text-based action. The text-based action is parsed into an executable action for the environment, which generates task rewards. These rewards are then used to fine-tune the entire VLM with RL. The method is evaluated on two domains: an original domain requiring fine-grained visual recognition and language reasoning, and an embodied AI domain focusing on visual semantic reasoning. Empirical results show that the proposed framework significantly improves the decision-making capabilities of VLMs, enabling 7b models to outperform commercial models such as GPT4-V and Gemini. The study also highlights the importance of CoT reasoning in enhancing RL training, as removing it leads to a significant decrease in performance. The framework is implemented using a combination of RL and CoT reasoning, with a scaling factor to balance the influence of CoT tokens on the action probability. The method is tested on various tasks, including arithmetic reasoning and visual semantic tasks, demonstrating its effectiveness across different domains. The results show that the proposed framework outperforms other methods, including supervised fine-tuning and CNN-based RL, in terms of performance and efficiency. The study also discusses the limitations of the approach, including the need for further exploration of different prompting techniques and the potential for extending the method to improve multiple tasks simultaneously.This paper proposes a framework for training large vision-language models (VLMs) with reinforcement learning (RL) to enhance their decision-making capabilities in multi-step goal-directed tasks. The framework involves a task description prompt that guides the VLM to generate chain-of-thought (CoT) reasoning, leading to a text-based action. The text-based action is parsed into an executable action for the environment, which generates task rewards. These rewards are then used to fine-tune the entire VLM with RL. The method is evaluated on two domains: an original domain requiring fine-grained visual recognition and language reasoning, and an embodied AI domain focusing on visual semantic reasoning. Empirical results show that the proposed framework significantly improves the decision-making capabilities of VLMs, enabling 7b models to outperform commercial models such as GPT4-V and Gemini. The study also highlights the importance of CoT reasoning in enhancing RL training, as removing it leads to a significant decrease in performance. The framework is implemented using a combination of RL and CoT reasoning, with a scaling factor to balance the influence of CoT tokens on the action probability. The method is tested on various tasks, including arithmetic reasoning and visual semantic tasks, demonstrating its effectiveness across different domains. The results show that the proposed framework outperforms other methods, including supervised fine-tuning and CNN-based RL, in terms of performance and efficiency. The study also discusses the limitations of the approach, including the need for further exploration of different prompting techniques and the potential for extending the method to improve multiple tasks simultaneously.
Reach us at info@study.space
Understanding Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning