PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

2024-2-13 | Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayaan Wahid, Zhuo Xu, Quan Vuong, Tingnan Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmanli, Yuke Zhu, Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, Brian Ichter
The paper introduces PIVOT (Prompting with Iterative Visual Optimization), a novel approach to enable vision-language models (VLMs) to handle spatial reasoning tasks without task-specific fine-tuning. PIVOT casts these tasks as iterative visual question answering, where the VLM is annotated with visual proposals (e.g., candidate robot actions) and iteratively refines these proposals to select the best ones. The method is evaluated on various spatial inference tasks, including robotic navigation, manipulation, and localization, demonstrating zero-shot control of robotic systems and improved performance in complex environments. The authors also discuss the limitations and potential improvements, highlighting the need for more advanced VLMs to enhance the capabilities of PIVOT.The paper introduces PIVOT (Prompting with Iterative Visual Optimization), a novel approach to enable vision-language models (VLMs) to handle spatial reasoning tasks without task-specific fine-tuning. PIVOT casts these tasks as iterative visual question answering, where the VLM is annotated with visual proposals (e.g., candidate robot actions) and iteratively refines these proposals to select the best ones. The method is evaluated on various spatial inference tasks, including robotic navigation, manipulation, and localization, demonstrating zero-shot control of robotic systems and improved performance in complex environments. The authors also discuss the limitations and potential improvements, highlighting the need for more advanced VLMs to enhance the capabilities of PIVOT.
Reach us at info@study.space
Understanding PIVOT%3A Iterative Visual Prompting Elicits Actionable Knowledge for VLMs