[slides] This%26That%3A Language-Gesture Controlled Video Generation for Robot Planning

This&That is a novel framework designed to enable robots to communicate, plan, and execute a wide range of tasks using language and gestures. The framework leverages video generative models trained on large-scale internet data to generate videos that align closely with user intentions, addressing three key challenges: unambiguous task communication, controllable video generation, and translating visual plans into robot actions. The proposed language-gesture conditioning method simplifies and clarifies communication, especially in complex and uncertain environments. The framework integrates a video-driven robot execution module using behavioral cloning, which efficiently follows the predicted video plans. Experiments on the Bridge dataset and IsaacGym simulation datasets demonstrate the effectiveness of *This&That* in generating high-quality videos and translating them into robot actions, showing superior alignment with user intentions compared to existing methods. The framework's ability to handle ambiguous and complex scenes highlights its potential for multi-task robot policy learning and real-world applications.This&That is a novel framework designed to enable robots to communicate, plan, and execute a wide range of tasks using language and gestures. The framework leverages video generative models trained on large-scale internet data to generate videos that align closely with user intentions, addressing three key challenges: unambiguous task communication, controllable video generation, and translating visual plans into robot actions. The proposed language-gesture conditioning method simplifies and clarifies communication, especially in complex and uncertain environments. The framework integrates a video-driven robot execution module using behavioral cloning, which efficiently follows the predicted video plans. Experiments on the Bridge dataset and IsaacGym simulation datasets demonstrate the effectiveness of *This&That* in generating high-quality videos and translating them into robot actions, showing superior alignment with user intentions compared to existing methods. The framework's ability to handle ambiguous and complex scenes highlights its potential for multi-task robot policy learning and real-world applications.

This&That: Language-Gesture Controlled Video Generation for Robot Planning

8 Jul 2024 | Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park