2 Apr 2024 | Zhihao Zhang, Shengcao Cao, Yu-Xiong Wang
**TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding**
The paper addresses the challenge of limited 3D shape datasets by proposing a novel multi-modal learning approach called TriAdapter Multi-Modal Learning (TAMM). TAMM aims to enhance 3D shape understanding by leveraging both 2D image and text modalities. The key contributions of TAMM include:
1. **Image-Text Re-Alignment**: TAMM introduces a CLIP Image Adapter (CIA) to fine-tune the CLIP model and re-align the image features with text features, addressing the domain gap between 2D images and natural images.
2. **Decoupled Tri-Modal Pre-Training**: TAMM employs Image Alignment Adapter (IAA) and Text Alignment Adapter (TAA) to decouple the 3D feature space into two sub-spaces: one focusing on visual attributes and the other on semantic understanding. This ensures a more comprehensive and effective multi-modal pre-training strategy.
3. **Performance Enhancements**: Extensive experiments demonstrate that TAMM consistently improves 3D representations across various 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, TAMM boosts zero-shot classification accuracy on Objaverse-LVIS from 46.8% to 50.7% and improves 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%.
The paper also includes a detailed methodology, experimental results, and qualitative evaluations to support the effectiveness of TAMM.**TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding**
The paper addresses the challenge of limited 3D shape datasets by proposing a novel multi-modal learning approach called TriAdapter Multi-Modal Learning (TAMM). TAMM aims to enhance 3D shape understanding by leveraging both 2D image and text modalities. The key contributions of TAMM include:
1. **Image-Text Re-Alignment**: TAMM introduces a CLIP Image Adapter (CIA) to fine-tune the CLIP model and re-align the image features with text features, addressing the domain gap between 2D images and natural images.
2. **Decoupled Tri-Modal Pre-Training**: TAMM employs Image Alignment Adapter (IAA) and Text Alignment Adapter (TAA) to decouple the 3D feature space into two sub-spaces: one focusing on visual attributes and the other on semantic understanding. This ensures a more comprehensive and effective multi-modal pre-training strategy.
3. **Performance Enhancements**: Extensive experiments demonstrate that TAMM consistently improves 3D representations across various 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, TAMM boosts zero-shot classification accuracy on Objaverse-LVIS from 46.8% to 50.7% and improves 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%.
The paper also includes a detailed methodology, experimental results, and qualitative evaluations to support the effectiveness of TAMM.