[slides and audio] ManiWAV%3A Learning Robot Manipulation from In-the-Wild Audio-Visual Data

ManiWAV is a novel system designed to learn robot manipulation skills from in-the-wild audio-visual data. The system addresses the limitations of traditional visual-based approaches by leveraging audio signals, which provide rich information about contact events, modes, surface materials, and object states. The key contributions of ManiWAV include: 1. **Data Collection Device**: An 'ear-in-hand' gripper that collects synchronized audio and visual data during human demonstrations, providing haptic feedback and capturing high-frequency audio signals. 2. **Policy Learning**: An end-to-end sensorimotor learning model that encodes and fuses vision and audio information using a transformer-based architecture, enabling the robot to learn from multimodal human demonstrations. 3. **Data Augmentation**: A strategy to bridge the domain gap between training and deployment data by augmenting audio signals with background noises and robot motor noises. 4. **Evaluation**: The system is evaluated on four contact-rich manipulation tasks (flipping a bagel, wiping a shape, pouring objects, and taping wires) and shown to outperform several baselines, demonstrating robustness and generalization to unseen environments. The study highlights the potential of audio signals in robot manipulation, particularly in scenarios where visual information is ambiguous or incomplete. The system's ability to learn from diverse in-the-wild human demonstrations and generalize to new environments makes it a promising approach for advancing robot manipulation capabilities.ManiWAV is a novel system designed to learn robot manipulation skills from in-the-wild audio-visual data. The system addresses the limitations of traditional visual-based approaches by leveraging audio signals, which provide rich information about contact events, modes, surface materials, and object states. The key contributions of ManiWAV include: 1. **Data Collection Device**: An 'ear-in-hand' gripper that collects synchronized audio and visual data during human demonstrations, providing haptic feedback and capturing high-frequency audio signals. 2. **Policy Learning**: An end-to-end sensorimotor learning model that encodes and fuses vision and audio information using a transformer-based architecture, enabling the robot to learn from multimodal human demonstrations. 3. **Data Augmentation**: A strategy to bridge the domain gap between training and deployment data by augmenting audio signals with background noises and robot motor noises. 4. **Evaluation**: The system is evaluated on four contact-rich manipulation tasks (flipping a bagel, wiping a shape, pouring objects, and taping wires) and shown to outperform several baselines, demonstrating robustness and generalization to unseen environments. The study highlights the potential of audio signals in robot manipulation, particularly in scenarios where visual information is ambiguous or incomplete. The system's ability to learn from diverse in-the-wild human demonstrations and generalize to new environments makes it a promising approach for advancing robot manipulation capabilities.

ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data

27 Jun 2024 | Zeyi Liu, Cheng Chi, Eric Cousineau, Naveen Kuppuswamy, Benjamin Burchfiel, Shuran Song