14 May 2024 | Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta
The paper "Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation" addresses the challenge of leveraging large-scale data for pre-training tactile sensors in robotics, particularly contact microphones. Traditional methods focus on visual pre-training, but this paper introduces a novel approach that uses audio-visual pre-training to enhance tactile sensing. The authors argue that contact microphones capture audio-based information, which can be leveraged through large-scale audio-visual datasets like Audioset. By pre-training an encoder using Audio-Visual Instance Discrimination (AVID), they initialize the encoder with rich, multimodal data, which is then used to train a policy that fuses visual and audio inputs for robotic manipulation tasks. The method is validated through experiments on three real-world manipulation tasks, demonstrating improved performance over visual-only policies and outperforming equivalent policies trained from scratch. The paper highlights the potential of large-scale multisensory pre-training for robotic manipulation, especially in low-data regimes.The paper "Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation" addresses the challenge of leveraging large-scale data for pre-training tactile sensors in robotics, particularly contact microphones. Traditional methods focus on visual pre-training, but this paper introduces a novel approach that uses audio-visual pre-training to enhance tactile sensing. The authors argue that contact microphones capture audio-based information, which can be leveraged through large-scale audio-visual datasets like Audioset. By pre-training an encoder using Audio-Visual Instance Discrimination (AVID), they initialize the encoder with rich, multimodal data, which is then used to train a policy that fuses visual and audio inputs for robotic manipulation tasks. The method is validated through experiments on three real-world manipulation tasks, demonstrating improved performance over visual-only policies and outperforming equivalent policies trained from scratch. The paper highlights the potential of large-scale multisensory pre-training for robotic manipulation, especially in low-data regimes.