31 Jan 2024 | Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, Alex Wong
This paper introduces UniTouch, a unified tactile model that aligns touch signals with pre-trained multimodal representations to enable zero-shot tactile sensing tasks. The model is designed for vision-based tactile sensors and can be applied to various tactile sensing hardware, including GelSight, DIGIT, Taxim, and Tacto. UniTouch learns sensor-specific tokens to handle variations between different tactile sensors and enables cross-modal tasks such as material classification, grasping stability prediction, and image synthesis. The model is trained using contrastive learning to align tactile embeddings with visual embeddings from large-scale vision-language data. It also incorporates a batch sampling strategy to improve performance across different sensors. The model is evaluated on multiple tasks, including zero-shot touch understanding, cross-modal retrieval, image synthesis with touch, and X-to-touch generation. Results show that UniTouch outperforms existing methods in material classification and grasping stability prediction, demonstrating its effectiveness in bridging touch with other modalities. The model's ability to generalize across different tactile sensors and modalities highlights its potential for applications in robotics and computer vision.This paper introduces UniTouch, a unified tactile model that aligns touch signals with pre-trained multimodal representations to enable zero-shot tactile sensing tasks. The model is designed for vision-based tactile sensors and can be applied to various tactile sensing hardware, including GelSight, DIGIT, Taxim, and Tacto. UniTouch learns sensor-specific tokens to handle variations between different tactile sensors and enables cross-modal tasks such as material classification, grasping stability prediction, and image synthesis. The model is trained using contrastive learning to align tactile embeddings with visual embeddings from large-scale vision-language data. It also incorporates a batch sampling strategy to improve performance across different sensors. The model is evaluated on multiple tasks, including zero-shot touch understanding, cross-modal retrieval, image synthesis with touch, and X-to-touch generation. Results show that UniTouch outperforms existing methods in material classification and grasping stability prediction, demonstrating its effectiveness in bridging touch with other modalities. The model's ability to generalize across different tactile sensors and modalities highlights its potential for applications in robotics and computer vision.