[slides and audio] PIVOT%3A Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

PIVOT is a novel visual prompting approach for Vision Language Models (VLMs) that enables them to perform zero-shot control of robotic systems and spatial reasoning tasks without requiring domain-specific training data. The method, called Prompting with Iterative Visual Optimization (PIVOT), casts tasks as iterative visual question answering. In each iteration, the image is annotated with visual representations of candidate actions or spatial locations, and the VLM selects the most promising ones. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. PIVOT is applied to real-world robotic navigation, manipulation, and spatial inference tasks, demonstrating the ability to perform zero-shot control of robotic systems without any robot training data, navigation in various environments, and other capabilities. Although current performance is far from perfect, PIVOT highlights the potential and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. The method involves generating candidate actions, projecting them into image space, and using the VLM to select the most promising ones. The process is repeated iteratively, with the distribution of candidate actions being refined in each iteration. PIVOT is evaluated on various tasks, including robotic control, object reference, and spatial reasoning, showing that it can perform well in these tasks. The method is also tested on real-world robotic systems and simulated environments, demonstrating its effectiveness in zero-shot scenarios. The results show that PIVOT can achieve non-zero task success in both navigation and manipulation tasks, with performance improving with more iterations and parallel calls. The method is also tested on RefCOCO spatial reasoning tasks, showing strong performance even in the first iteration. The results indicate that PIVOT can be a promising approach for zero-shot robotic control and spatial reasoning tasks using VLMs.PIVOT is a novel visual prompting approach for Vision Language Models (VLMs) that enables them to perform zero-shot control of robotic systems and spatial reasoning tasks without requiring domain-specific training data. The method, called Prompting with Iterative Visual Optimization (PIVOT), casts tasks as iterative visual question answering. In each iteration, the image is annotated with visual representations of candidate actions or spatial locations, and the VLM selects the most promising ones. These proposals are iteratively refined, allowing the VLM to eventually zero in on the best available answer. PIVOT is applied to real-world robotic navigation, manipulation, and spatial inference tasks, demonstrating the ability to perform zero-shot control of robotic systems without any robot training data, navigation in various environments, and other capabilities. Although current performance is far from perfect, PIVOT highlights the potential and limitations of this new regime and shows a promising approach for Internet-Scale VLMs in robotic and spatial reasoning domains. The method involves generating candidate actions, projecting them into image space, and using the VLM to select the most promising ones. The process is repeated iteratively, with the distribution of candidate actions being refined in each iteration. PIVOT is evaluated on various tasks, including robotic control, object reference, and spatial reasoning, showing that it can perform well in these tasks. The method is also tested on real-world robotic systems and simulated environments, demonstrating its effectiveness in zero-shot scenarios. The results show that PIVOT can achieve non-zero task success in both navigation and manipulation tasks, with performance improving with more iterations and parallel calls. The method is also tested on RefCOCO spatial reasoning tasks, showing strong performance even in the first iteration. The results indicate that PIVOT can be a promising approach for zero-shot robotic control and spatial reasoning tasks using VLMs.

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs