20 Feb 2024 | Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg
This paper introduces a new dataset called Touch-Vision-Language (TVL), containing 44,000 paired vision-tactile observations with human-annotated and pseudo-labeled language descriptions. The dataset was collected using a 3D-printed device that synchronously captures tactile and visual data in-the-wild. The dataset includes 10% human-labeled data and 90% pseudo-labeled data generated by GPT-4V. The authors train a tactile encoder aligned with both vision and language modalities, and a Touch-Vision-Language (TVL) model for text generation. The TVL model demonstrates improved performance in touch-vision-language alignment compared to existing models, achieving a 29% increase in classification accuracy. The TVL model also outperforms GPT-4V and open-source vision-language models on a new benchmark, with improvements of 12% and 32%, respectively. The dataset and code are available at https://tactile-vlm.github.io. The work addresses the challenge of incorporating touch into multimodal models by leveraging pseudo-labels generated by large language models, reducing the need for extensive human labeling. The TVL dataset and model provide a new framework for multimodal alignment and touch-based understanding.This paper introduces a new dataset called Touch-Vision-Language (TVL), containing 44,000 paired vision-tactile observations with human-annotated and pseudo-labeled language descriptions. The dataset was collected using a 3D-printed device that synchronously captures tactile and visual data in-the-wild. The dataset includes 10% human-labeled data and 90% pseudo-labeled data generated by GPT-4V. The authors train a tactile encoder aligned with both vision and language modalities, and a Touch-Vision-Language (TVL) model for text generation. The TVL model demonstrates improved performance in touch-vision-language alignment compared to existing models, achieving a 29% increase in classification accuracy. The TVL model also outperforms GPT-4V and open-source vision-language models on a new benchmark, with improvements of 12% and 32%, respectively. The dataset and code are available at https://tactile-vlm.github.io. The work addresses the challenge of incorporating touch into multimodal models by leveraging pseudo-labels generated by large language models, reducing the need for extensive human labeling. The TVL dataset and model provide a new framework for multimodal alignment and touch-based understanding.