ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

27 Jun 2024 | Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfield, Shuran Song
ManiWAV is a system that learns robot manipulation from in-the-wild audio-visual data. The system uses an 'ear-in-hand' data collection device to gather synchronized audio and visual feedback from human demonstrations, enabling the direct learning of robot manipulation policies. The device, built on a hand-held gripper, captures high-frequency audio signals and provides haptic feedback during data collection. The system also includes a policy interface that learns robot manipulation policies directly from the demonstrations. The system is tested on four contact-rich manipulation tasks: wiping a shape from a whiteboard, flipping a bagel with a spatula, pouring objects from a cup, and taping wires with velcro tape. The system demonstrates the ability to generalize to unseen in-the-wild environments by learning from diverse human demonstrations. The system uses audio data to detect contact events, modes, surface materials, and object states, which are not easily captured by visual data alone. Audio data is scalable for data collection and policy learning due to its availability and low cost. The system uses a transformer-based model to encode and fuse vision and audio information, with a diffusion head for action prediction. The system also includes a data augmentation strategy to bridge the audio domain gap between in-the-wild data and actual robot deployment. The system outperforms alternative approaches on four contact-rich manipulation tasks and generalizes to unseen in-the-wild environments. The system is evaluated on four tasks, showing that audio feedback improves robustness and generalizability. The system also demonstrates the effectiveness of using a transformer to fuse vision and audio features compared to using an MLP. The system is limited by the need for high-quality audio signals and the potential for domain gaps between training and deployment data. Future work could explore hierarchical network architectures to infer higher frequency actions from audio inputs.ManiWAV is a system that learns robot manipulation from in-the-wild audio-visual data. The system uses an 'ear-in-hand' data collection device to gather synchronized audio and visual feedback from human demonstrations, enabling the direct learning of robot manipulation policies. The device, built on a hand-held gripper, captures high-frequency audio signals and provides haptic feedback during data collection. The system also includes a policy interface that learns robot manipulation policies directly from the demonstrations. The system is tested on four contact-rich manipulation tasks: wiping a shape from a whiteboard, flipping a bagel with a spatula, pouring objects from a cup, and taping wires with velcro tape. The system demonstrates the ability to generalize to unseen in-the-wild environments by learning from diverse human demonstrations. The system uses audio data to detect contact events, modes, surface materials, and object states, which are not easily captured by visual data alone. Audio data is scalable for data collection and policy learning due to its availability and low cost. The system uses a transformer-based model to encode and fuse vision and audio information, with a diffusion head for action prediction. The system also includes a data augmentation strategy to bridge the audio domain gap between in-the-wild data and actual robot deployment. The system outperforms alternative approaches on four contact-rich manipulation tasks and generalizes to unseen in-the-wild environments. The system is evaluated on four tasks, showing that audio feedback improves robustness and generalizability. The system also demonstrates the effectiveness of using a transformer to fuse vision and audio features compared to using an MLP. The system is limited by the need for high-quality audio signals and the potential for domain gaps between training and deployment data. Future work could explore hierarchical network architectures to infer higher frequency actions from audio inputs.
Reach us at info@study.space
[slides and audio] ManiWAV%3A Learning Robot Manipulation from In-the-Wild Audio-Visual Data