8 Jul 2024 | Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park
This&That is a framework for robot planning that combines language and gesture to generate videos for task execution. The system uses a video generative model trained on large-scale internet data to create videos that reflect user intentions. It addresses three key challenges in video-based planning: unambiguous task communication, controllable video generation, and translating visual plans into robot actions. The framework uses language-gesture conditioning to generate videos, which is more effective than language-only methods, especially in complex environments. A behavioral cloning design is integrated to incorporate video plans into robot actions. This&That demonstrates state-of-the-art performance in video generation and task execution, showing the effectiveness of video generation as an intermediate representation for generalizable planning. The system was tested on the Bridge dataset and IsaacGym simulation, showing superior alignment with user intentions and successful translation of video plans into robot actions. The framework uses a video diffusion model conditioned on language and gestures to generate videos, and a behavioral cloning model to execute these videos. The system was evaluated on various tasks, showing its effectiveness in complex and uncertain environments. The framework also addresses limitations such as object shape changes and 3D ambiguity in gesture-based control. The results demonstrate the potential of this approach for multi-task human-robot collaboration.This&That is a framework for robot planning that combines language and gesture to generate videos for task execution. The system uses a video generative model trained on large-scale internet data to create videos that reflect user intentions. It addresses three key challenges in video-based planning: unambiguous task communication, controllable video generation, and translating visual plans into robot actions. The framework uses language-gesture conditioning to generate videos, which is more effective than language-only methods, especially in complex environments. A behavioral cloning design is integrated to incorporate video plans into robot actions. This&That demonstrates state-of-the-art performance in video generation and task execution, showing the effectiveness of video generation as an intermediate representation for generalizable planning. The system was tested on the Bridge dataset and IsaacGym simulation, showing superior alignment with user intentions and successful translation of video plans into robot actions. The framework uses a video diffusion model conditioned on language and gestures to generate videos, and a behavioral cloning model to execute these videos. The system was evaluated on various tasks, showing its effectiveness in complex and uncertain environments. The framework also addresses limitations such as object shape changes and 3D ambiguity in gesture-based control. The results demonstrate the potential of this approach for multi-task human-robot collaboration.