Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training

22 Jan 2024 | Vedant Dave*, Fotios Lygerakis*, Elmar Rueckert
The paper introduces MViTac, a novel methodology for integrating visual and tactile sensory data through self-supervised contrastive learning. MViTac leverages both intra-modality and inter-modality losses to enhance material property classification and grasping prediction. The method uses dual encoders for visual and tactile data, employing InfoNCE loss for both intra-modal and inter-modal learning. Experiments on the Touch-and-Go and Calandra datasets demonstrate the effectiveness of MViTac, showing superior performance over existing self-supervised and supervised techniques. The results highlight the importance of combining visual and tactile information for more robust and nuanced robotic tasks. The paper also discusses related work in tactile sensing, vision, and visual-tactile joint representation learning, providing a comprehensive overview of the field.The paper introduces MViTac, a novel methodology for integrating visual and tactile sensory data through self-supervised contrastive learning. MViTac leverages both intra-modality and inter-modality losses to enhance material property classification and grasping prediction. The method uses dual encoders for visual and tactile data, employing InfoNCE loss for both intra-modal and inter-modal learning. Experiments on the Touch-and-Go and Calandra datasets demonstrate the effectiveness of MViTac, showing superior performance over existing self-supervised and supervised techniques. The results highlight the importance of combining visual and tactile information for more robust and nuanced robotic tasks. The paper also discusses related work in tactile sensing, vision, and visual-tactile joint representation learning, providing a comprehensive overview of the field.
Reach us at info@study.space
Understanding Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training