Understanding Vision-Language Models Provide Promptable Representations for Reinforcement Learning

The paper "Vision-Language Models Provide Promptable Representations for Reinforcement Learning" introduces a novel approach to enhance reinforcement learning (RL) by leveraging the extensive knowledge encoded in vision-language models (VLMs). The authors propose Promptable Representations for Reinforcement Learning (PR2L), which uses VLMs to generate promptable representations that integrate task context and auxiliary information. These representations are then used to train RL policies, improving their performance on visually complex and long-horizon tasks. Key contributions of the paper include: 1. **PR2L Framework**: A flexible framework that uses VLMs to produce semantic features from visual observations, integrating prior task knowledge and grounding them into actions via RL. 2. **Task-Relevant Prompts**: Designing prompts that elicit useful representations from VLMs by asking questions that provide context and auxiliary information, such as "Is there a [target entity] in this image?" or "Would a [target object] be found here? Why or why not?" 3. **Evaluation**: Extensive experiments in Minecraft and Habitat domains show that PR2L outperforms baseline methods, including those using non-promptable image encoders, instruction-following approaches, and domain-specific representations. 4. **Chain-of-Thought Reasoning**: Demonstrating that PR2L can leverage chain-of-thought reasoning to improve policy performance in novel scenes by 1.5 times. The paper highlights the benefits of using VLMs to transfer general-purpose world knowledge to RL, making it a promising approach for embodied RL tasks.The paper "Vision-Language Models Provide Promptable Representations for Reinforcement Learning" introduces a novel approach to enhance reinforcement learning (RL) by leveraging the extensive knowledge encoded in vision-language models (VLMs). The authors propose Promptable Representations for Reinforcement Learning (PR2L), which uses VLMs to generate promptable representations that integrate task context and auxiliary information. These representations are then used to train RL policies, improving their performance on visually complex and long-horizon tasks. Key contributions of the paper include: 1. **PR2L Framework**: A flexible framework that uses VLMs to produce semantic features from visual observations, integrating prior task knowledge and grounding them into actions via RL. 2. **Task-Relevant Prompts**: Designing prompts that elicit useful representations from VLMs by asking questions that provide context and auxiliary information, such as "Is there a [target entity] in this image?" or "Would a [target object] be found here? Why or why not?" 3. **Evaluation**: Extensive experiments in Minecraft and Habitat domains show that PR2L outperforms baseline methods, including those using non-promptable image encoders, instruction-following approaches, and domain-specific representations. 4. **Chain-of-Thought Reasoning**: Demonstrating that PR2L can leverage chain-of-thought reasoning to improve policy performance in novel scenes by 1.5 times. The paper highlights the benefits of using VLMs to transfer general-purpose world knowledge to RL, making it a promising approach for embodied RL tasks.

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

23 May 2024 | William Chen, Oier Mees, Aviral Kumar, Sergey Levine