22 Jan 2024 | Vedant Dave*, Fotios Lygerakis*, Elmar Rueckert
This paper introduces MViTac, a novel method for multimodal visual-tactile representation learning through self-supervised contrastive pre-training. The goal is to effectively fuse visual and tactile data to enhance the understanding and navigation of the physical world, enabling more adaptive responses to changing environments. MViTac leverages both sensory inputs to learn representations through intra and inter-modal losses, resulting in improved material property classification and more accurate grasping prediction. The method is evaluated on two tasks: material classification and grasping success prediction. Results show that MViTac outperforms existing self-supervised and supervised techniques, demonstrating the effectiveness of the approach in developing robust modality encoders. The method uses two sets of encoders to compute the InfoNCE loss for both modalities, with intra-modal loss maximizing agreement within the same modality and inter-modal loss maximizing similarity across different modalities. The approach is trained on the Touch-and-Go and Calandra datasets and evaluated on various downstream tasks. The results indicate that MViTac achieves superior performance in both tactile-only and visual-tactile settings, highlighting the importance of integrating visual and tactile data for improved robotic manipulation. The study also discusses the limitations of the approach, including the need for further validation on real robotic platforms and the potential for expanding the scope of evaluation to include more complex tasks. Overall, MViTac demonstrates the effectiveness of self-supervised contrastive learning in multimodal representation learning for robotic applications.This paper introduces MViTac, a novel method for multimodal visual-tactile representation learning through self-supervised contrastive pre-training. The goal is to effectively fuse visual and tactile data to enhance the understanding and navigation of the physical world, enabling more adaptive responses to changing environments. MViTac leverages both sensory inputs to learn representations through intra and inter-modal losses, resulting in improved material property classification and more accurate grasping prediction. The method is evaluated on two tasks: material classification and grasping success prediction. Results show that MViTac outperforms existing self-supervised and supervised techniques, demonstrating the effectiveness of the approach in developing robust modality encoders. The method uses two sets of encoders to compute the InfoNCE loss for both modalities, with intra-modal loss maximizing agreement within the same modality and inter-modal loss maximizing similarity across different modalities. The approach is trained on the Touch-and-Go and Calandra datasets and evaluated on various downstream tasks. The results indicate that MViTac achieves superior performance in both tactile-only and visual-tactile settings, highlighting the importance of integrating visual and tactile data for improved robotic manipulation. The study also discusses the limitations of the approach, including the need for further validation on real robotic platforms and the potential for expanding the scope of evaluation to include more complex tasks. Overall, MViTac demonstrates the effectiveness of self-supervised contrastive learning in multimodal representation learning for robotic applications.