7 May 2024 | Yiming Dou1 Fengyu Yang2 Yi Liu1 Antonio Loquercio3 Andrew Owens1
The paper introduces a novel scene representation called Tactile-Augmented Radiance Fields (TaRF), which integrates visual and tactile information into a shared 3D space. This representation allows for the estimation of both visual and tactile signals at any 3D position within a scene. The authors capture TaRFs from a collection of photos and sparsely sampled touch probes by registering visual and tactile signals into a unified 3D space and training a conditional diffusion model to impute touch signals at unprobed locations. They collect a large dataset of TaRFs, which contains more touch samples than previous real-world datasets and provides spatially aligned visual signals for each touch signal. The evaluation of the approach demonstrates the accuracy of the cross-modal generative model and the utility of the captured visual-tactile data on downstream tasks such as tactile localization and material classification. The work highlights the importance of leveraging 3D scene geometry and multi-view geometry constraints to improve the performance of cross-modal prediction models.The paper introduces a novel scene representation called Tactile-Augmented Radiance Fields (TaRF), which integrates visual and tactile information into a shared 3D space. This representation allows for the estimation of both visual and tactile signals at any 3D position within a scene. The authors capture TaRFs from a collection of photos and sparsely sampled touch probes by registering visual and tactile signals into a unified 3D space and training a conditional diffusion model to impute touch signals at unprobed locations. They collect a large dataset of TaRFs, which contains more touch samples than previous real-world datasets and provides spatially aligned visual signals for each touch signal. The evaluation of the approach demonstrates the accuracy of the cross-modal generative model and the utility of the captured visual-tactile data on downstream tasks such as tactile localization and material classification. The work highlights the importance of leveraging 3D scene geometry and multi-view geometry constraints to improve the performance of cross-modal prediction models.