Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

14 May 2024 | Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta
This paper introduces a novel approach for robotic manipulation using audio-visual pretraining with contact microphones as an alternative tactile sensor. The key idea is to leverage large-scale audio-visual pretraining to obtain representations that improve robotic manipulation performance. Unlike traditional tactile sensors, contact microphones capture audio-based information, allowing the use of large-scale audio-visual pretraining to enhance tactile representations. The authors propose a method that uses Audio-Visual Instance Discrimination (AVID) for pretraining, which is then used to initialize the encoder for learning contact audio representations. The method is trained with behavior cloning to predict actions based on visual and audio inputs. The approach is validated on three real-world manipulation tasks in the low-data regime, showing improved performance over visual-only and scratch baselines. The results demonstrate that large-scale pretraining is beneficial for robotic manipulation, especially in low-data settings. The method outperforms baselines in terms of success rates and reward, showing the effectiveness of audio-visual pretraining for robotic manipulation. The paper also discusses related work, including audio in robotics, tactile sensing for manipulation, audio-visual representation learning, and representation learning for robotic manipulation. The experiments show that the method is robust to visual differences between training and test settings, and that pre-trained audio features help prevent overfitting to visual details. The paper concludes that the use of contact microphones and large-scale audio-visual pretraining is a promising approach for improving robotic manipulation performance.This paper introduces a novel approach for robotic manipulation using audio-visual pretraining with contact microphones as an alternative tactile sensor. The key idea is to leverage large-scale audio-visual pretraining to obtain representations that improve robotic manipulation performance. Unlike traditional tactile sensors, contact microphones capture audio-based information, allowing the use of large-scale audio-visual pretraining to enhance tactile representations. The authors propose a method that uses Audio-Visual Instance Discrimination (AVID) for pretraining, which is then used to initialize the encoder for learning contact audio representations. The method is trained with behavior cloning to predict actions based on visual and audio inputs. The approach is validated on three real-world manipulation tasks in the low-data regime, showing improved performance over visual-only and scratch baselines. The results demonstrate that large-scale pretraining is beneficial for robotic manipulation, especially in low-data settings. The method outperforms baselines in terms of success rates and reward, showing the effectiveness of audio-visual pretraining for robotic manipulation. The paper also discusses related work, including audio in robotics, tactile sensing for manipulation, audio-visual representation learning, and representation learning for robotic manipulation. The experiments show that the method is robust to visual differences between training and test settings, and that pre-trained audio features help prevent overfitting to visual details. The paper concludes that the use of contact microphones and large-scale audio-visual pretraining is a promising approach for improving robotic manipulation performance.
Reach us at info@study.space
[slides] Hearing Touch%3A Audio-Visual Pretraining for Contact-Rich Manipulation | StudySpace