20 Feb 2024 | Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg
This paper introduces a new dataset called Touch-Vision-Language (TVL) to bridge the gap between tactile and language modalities. The TVL dataset consists of 44K paired vision-tactile observations, with 10% of the data annotated by humans and the rest labeled by GPT-4V. The authors train a vision-and-language-aligned tactile encoder using pairwise contrastive learning and a Touch-Vision-Language (TVL) model for generating tactile descriptions from visual and tactile inputs. The TVL model demonstrates significant improvements in touch-vision-language alignment compared to existing models, achieving a 29% increase in classification accuracy and a 12% improvement over GPT-4V on a new touch-vision understanding benchmark. The paper also discusses the challenges and limitations of the current approach, highlighting the need for larger and more diverse touch-vision-language datasets to further enhance multimodal alignment.This paper introduces a new dataset called Touch-Vision-Language (TVL) to bridge the gap between tactile and language modalities. The TVL dataset consists of 44K paired vision-tactile observations, with 10% of the data annotated by humans and the rest labeled by GPT-4V. The authors train a vision-and-language-aligned tactile encoder using pairwise contrastive learning and a Touch-Vision-Language (TVL) model for generating tactile descriptions from visual and tactile inputs. The TVL model demonstrates significant improvements in touch-vision-language alignment compared to existing models, achieving a 29% increase in classification accuracy and a 12% improvement over GPT-4V on a new touch-vision understanding benchmark. The paper also discusses the challenges and limitations of the current approach, highlighting the need for larger and more diverse touch-vision-language datasets to further enhance multimodal alignment.