RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

5 Jul 2024 | Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, Yue Wang
RAM is a retrieval-based affordance transfer framework for generalizable zero-shot robotic manipulation. Unlike existing methods that rely on in-domain demonstrations, RAM leverages out-of-domain data to learn versatile manipulation capabilities. It first extracts unified affordance from diverse sources, including robotic data, human-object interaction (HOI) data, and custom data, to build a comprehensive affordance memory. Given a language instruction, RAM hierarchically retrieves the most similar demonstration from the memory and transfers the 2D affordance to 3D for robotic execution. This approach enables effective manipulation across various objects, environments, and robotic embodiments. Extensive simulations and real-world experiments show that RAM outperforms existing methods in diverse tasks, demonstrating its effectiveness in zero-shot scenarios. RAM also supports downstream applications such as automatic data collection, one-shot visual imitation, and integration with large language models (LLMs) and vision-language models (VLMs) for long-horizon tasks. The method is data-efficient and embodiment-agnostic, making it suitable for a wide range of robotic applications. Key contributions include a retrieval-based affordance transfer framework, a scalable module for extracting unified affordance from heterogeneous data, and the ability to enable various downstream applications. The framework is evaluated on multiple tasks, showing significant improvements in success rates compared to baselines. RAM's ability to generalize across domains and tasks highlights its potential for future robotic manipulation research.RAM is a retrieval-based affordance transfer framework for generalizable zero-shot robotic manipulation. Unlike existing methods that rely on in-domain demonstrations, RAM leverages out-of-domain data to learn versatile manipulation capabilities. It first extracts unified affordance from diverse sources, including robotic data, human-object interaction (HOI) data, and custom data, to build a comprehensive affordance memory. Given a language instruction, RAM hierarchically retrieves the most similar demonstration from the memory and transfers the 2D affordance to 3D for robotic execution. This approach enables effective manipulation across various objects, environments, and robotic embodiments. Extensive simulations and real-world experiments show that RAM outperforms existing methods in diverse tasks, demonstrating its effectiveness in zero-shot scenarios. RAM also supports downstream applications such as automatic data collection, one-shot visual imitation, and integration with large language models (LLMs) and vision-language models (VLMs) for long-horizon tasks. The method is data-efficient and embodiment-agnostic, making it suitable for a wide range of robotic applications. Key contributions include a retrieval-based affordance transfer framework, a scalable module for extracting unified affordance from heterogeneous data, and the ability to enable various downstream applications. The framework is evaluated on multiple tasks, showing significant improvements in success rates compared to baselines. RAM's ability to generalize across domains and tasks highlights its potential for future robotic manipulation research.
Reach us at info@study.space