RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches

RT-Sketch: Goal-Conditioned Imitation Learning from Hand-Drawn Sketches

5 Mar 2024 | Priya Sundaresan, Quan Vuong, Jiayuan Gu, Peng Xu, Ted Xiao, Sean Kirmani, Tianhe Yu, Michael Stark, Ajinkya Jain, Karol Hausman, Dorsa Sadigh, Jeannette Bohg, Stefan Schaal
RT-Sketch is a goal-conditioned imitation learning policy that uses hand-drawn sketches as input to generate actions for manipulation tasks. The policy is trained on a dataset of 80,000 trajectories paired with synthetic goal sketches. It is designed to handle a variety of manipulation skills involving tabletop object rearrangements on an articulated countertop. RT-Sketch is compared to policies conditioned on language or images, and it is found to perform similarly in straightforward settings, while being more robust in ambiguous or visually distracting scenarios. It can interpret and act upon sketches with varying levels of specificity, from minimal line drawings to detailed, colored drawings. The policy is trained using a modified version of the RT-1 architecture, which is adapted to consume visual goals rather than language. The policy is evaluated on six manipulation skills, and it is found to outperform image or language-conditioned policies in terms of spatial precision and alignment scores when language is ambiguous or visual distractors are present. RT-Sketch is also shown to handle different levels of input specificity, ranging from rough sketches to more scene-preserving, colored drawings. The work highlights the potential of hand-drawn sketches as a modality for goal specification in visual imitation learning, offering a balance between expressiveness and spatial awareness.RT-Sketch is a goal-conditioned imitation learning policy that uses hand-drawn sketches as input to generate actions for manipulation tasks. The policy is trained on a dataset of 80,000 trajectories paired with synthetic goal sketches. It is designed to handle a variety of manipulation skills involving tabletop object rearrangements on an articulated countertop. RT-Sketch is compared to policies conditioned on language or images, and it is found to perform similarly in straightforward settings, while being more robust in ambiguous or visually distracting scenarios. It can interpret and act upon sketches with varying levels of specificity, from minimal line drawings to detailed, colored drawings. The policy is trained using a modified version of the RT-1 architecture, which is adapted to consume visual goals rather than language. The policy is evaluated on six manipulation skills, and it is found to outperform image or language-conditioned policies in terms of spatial precision and alignment scores when language is ambiguous or visual distractors are present. RT-Sketch is also shown to handle different levels of input specificity, ranging from rough sketches to more scene-preserving, colored drawings. The work highlights the potential of hand-drawn sketches as a modality for goal specification in visual imitation learning, offering a balance between expressiveness and spatial awareness.
Reach us at info@study.space