2 Jul 2024 | Albert Yu, Adeline Foote, Raymond Mooney, and Roberto Martín-Martín
This paper introduces Lang4Sim2Real, a framework that uses natural language to bridge the sim2real gap in visual imitation learning. The main challenge in learning image-conditioned robotic policies is acquiring a visual representation that is conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a large amount of visual data, which is expensive in the real world. Sim2Real is a promising approach that uses simulators to collect cheap data closely related to the target task. However, transferring policies from sim to real is difficult when the domains are visually dissimilar. To address this, the authors propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Their key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. They demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. This image encoder is then used as the backbone of an IL policy trained on both simulated and real demonstrations. The approach outperforms prior sim2real methods and vision-language pretraining baselines like CLIP and R3M by 25-40%. The paper also discusses related work in vision pretraining for robotics, sim2real techniques, and domain-invariant representations. The authors propose a method that leverages language to learn domain-invariant image representations, which can be used to transfer policies between visually dissimilar domains. The method involves pretraining an image encoder on cross-domain language-annotated image data and then training a policy network on action-labeled data from both domains. The results show that the approach outperforms existing methods in sim2real and vision representation learning. The paper also discusses limitations and future work, including the need to scale the method to large-scale datasets and the potential for combining it with other pretraining approaches.This paper introduces Lang4Sim2Real, a framework that uses natural language to bridge the sim2real gap in visual imitation learning. The main challenge in learning image-conditioned robotic policies is acquiring a visual representation that is conducive to low-level control. Due to the high dimensionality of the image space, learning a good visual representation requires a large amount of visual data, which is expensive in the real world. Sim2Real is a promising approach that uses simulators to collect cheap data closely related to the target task. However, transferring policies from sim to real is difficult when the domains are visually dissimilar. To address this, the authors propose using natural language descriptions of images as a unifying signal across domains that captures the underlying task-relevant semantics. Their key insight is that if two image observations from different domains are labeled with similar language, the policy should predict similar action distributions for both images. They demonstrate that training the image encoder to predict the language description or the distance between descriptions of a sim or real image serves as a useful, data-efficient pretraining step that helps learn a domain-invariant image representation. This image encoder is then used as the backbone of an IL policy trained on both simulated and real demonstrations. The approach outperforms prior sim2real methods and vision-language pretraining baselines like CLIP and R3M by 25-40%. The paper also discusses related work in vision pretraining for robotics, sim2real techniques, and domain-invariant representations. The authors propose a method that leverages language to learn domain-invariant image representations, which can be used to transfer policies between visually dissimilar domains. The method involves pretraining an image encoder on cross-domain language-annotated image data and then training a policy network on action-labeled data from both domains. The results show that the approach outperforms existing methods in sim2real and vision representation learning. The paper also discusses limitations and future work, including the need to scale the method to large-scale datasets and the potential for combining it with other pretraining approaches.