[slides and audio] Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

The paper introduces a framework for training large Vision-Language Models (VLMs) using Reinforcement Learning (RL) to enhance their decision-making capabilities in multi-step goal-directed tasks. The framework prompts the VLM to generate chain-of-thought (CoT) reasoning, which is then parsed into executable actions to interact with the environment and receive task rewards. The VLMs are fine-tuned using these rewards, improving their performance across various tasks. Empirical results show that the proposed method outperforms commercial models like GPT4-V and Gemini, demonstrating the effectiveness of CoT reasoning in enhancing VLMs' decision-making abilities. The study also highlights the importance of moderate scaling factors for the log-likelihood of CoT tokens to balance the influence of CoT and text-based actions.The paper introduces a framework for training large Vision-Language Models (VLMs) using Reinforcement Learning (RL) to enhance their decision-making capabilities in multi-step goal-directed tasks. The framework prompts the VLM to generate chain-of-thought (CoT) reasoning, which is then parsed into executable actions to interact with the environment and receive task rewards. The VLMs are fine-tuned using these rewards, improving their performance across various tasks. Empirical results show that the proposed method outperforms commercial models like GPT4-V and Gemini, demonstrating the effectiveness of CoT reasoning in enhancing VLMs' decision-making abilities. The study also highlights the importance of moderate scaling factors for the log-likelihood of CoT tokens to balance the influence of CoT and text-based actions.

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

17 May 2024 | Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine