5 Mar 2024 | Priya Sundaresan, Quan Vuong, Jiayuan Gu, Peng Xu, Ted Xiao, Sean Kirmani, Tianhe Yu, Michael Stark, Ajinkya Jain, Karol Hausman, Dorsa Sadigh, Jeannette Bohg, Stefan Schaal
RT-Sketch is a novel goal-conditioned imitation learning (IL) policy designed for manipulation tasks, which takes a hand-drawn sketch of the desired scene as input and outputs actions. The authors address the limitations of natural language and images as goal representations, which can be ambiguous or over-specified, respectively. By using hand-drawn sketches, RT-Sketch aims to provide a more spatially aware and robust goal specification method. The policy is trained on a dataset of 80K trajectories paired with synthetic goal sketches generated by an image-to-sketch stylization network. Evaluations on six manipulation skills involving tabletop object rearrangements show that RT-Sketch performs similarly to image or language-conditioned agents in straightforward settings but achieves significantly better robustness when language goals are ambiguous or visual distractors are present. Additionally, RT-Sketch can handle sketches with varying levels of specificity, from minimal line drawings to detailed, colored drawings. The work also discusses related work in goal-conditioned IL and image-sketch conversion, and provides a detailed experimental setup and results.RT-Sketch is a novel goal-conditioned imitation learning (IL) policy designed for manipulation tasks, which takes a hand-drawn sketch of the desired scene as input and outputs actions. The authors address the limitations of natural language and images as goal representations, which can be ambiguous or over-specified, respectively. By using hand-drawn sketches, RT-Sketch aims to provide a more spatially aware and robust goal specification method. The policy is trained on a dataset of 80K trajectories paired with synthetic goal sketches generated by an image-to-sketch stylization network. Evaluations on six manipulation skills involving tabletop object rearrangements show that RT-Sketch performs similarly to image or language-conditioned agents in straightforward settings but achieves significantly better robustness when language goals are ambiguous or visual distractors are present. Additionally, RT-Sketch can handle sketches with varying levels of specificity, from minimal line drawings to detailed, colored drawings. The work also discusses related work in goal-conditioned IL and image-sketch conversion, and provides a detailed experimental setup and results.