The paper introduces a framework for training large Vision-Language Models (VLMs) using Reinforcement Learning (RL) to enhance their decision-making capabilities in multi-step goal-directed tasks. The framework prompts the VLM to generate chain-of-thought (CoT) reasoning, which is then parsed into executable actions to interact with the environment and receive task rewards. The VLMs are fine-tuned using these rewards, improving their performance across various tasks. Empirical results show that the proposed method outperforms commercial models like GPT4-V and Gemini, demonstrating the effectiveness of CoT reasoning in enhancing VLMs' decision-making abilities. The study also highlights the importance of moderate scaling factors for the log-likelihood of CoT tokens to balance the influence of CoT and text-based actions.The paper introduces a framework for training large Vision-Language Models (VLMs) using Reinforcement Learning (RL) to enhance their decision-making capabilities in multi-step goal-directed tasks. The framework prompts the VLM to generate chain-of-thought (CoT) reasoning, which is then parsed into executable actions to interact with the environment and receive task rewards. The VLMs are fine-tuned using these rewards, improving their performance across various tasks. Empirical results show that the proposed method outperforms commercial models like GPT4-V and Gemini, demonstrating the effectiveness of CoT reasoning in enhancing VLMs' decision-making abilities. The study also highlights the importance of moderate scaling factors for the log-likelihood of CoT tokens to balance the influence of CoT and text-based actions.