TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

2 Apr 2024 | Zhihao Zhang, Shengcao Cao, Yu-Xiong Wang
TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding This paper proposes TAMM, a novel two-stage multi-modal learning approach for 3D shape understanding. TAMM addresses the under-utilization of 2D images in existing multi-modal methods by introducing three synergistic adapters: CLIP Image Adapter (CIA), Image Alignment Adapter (IAA), and Text Alignment Adapter (TAA). The first stage of TAMM fine-tunes the CLIP model to better align 3D shapes with image and text features, mitigating the domain gap between 3D-rendered images and natural images. The second stage decouples 3D features into two sub-spaces: one focusing on visual attributes and the other on semantic understanding, enabling more comprehensive and effective multi-modal pre-training. TAMM consistently enhances 3D representations across various 3D encoder architectures, pre-training datasets, and downstream tasks. It achieves a 3.3% improvement in zero-shot classification accuracy on Objaverse-LVIS and a 2.9% improvement on ModelNet40. The method also improves linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%. TAMM's effectiveness is demonstrated through extensive experiments on multiple benchmarks, including Objaverse-LVIS, ModelNet40, ScanObjectNN, and ScanNet. The method's key contributions include identifying the under-utilization of the 2D image modality in existing multi-modal methods, proposing a novel multi-modal learning framework with two stages and three unified adapter modules, and demonstrating that TAMM consistently enhances 3D representations for various 3D encoder architectures, pre-training datasets, and downstream tasks. The method's success is attributed to its ability to effectively leverage both image and language modalities, leading to more robust and generalizable 3D representations.TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding This paper proposes TAMM, a novel two-stage multi-modal learning approach for 3D shape understanding. TAMM addresses the under-utilization of 2D images in existing multi-modal methods by introducing three synergistic adapters: CLIP Image Adapter (CIA), Image Alignment Adapter (IAA), and Text Alignment Adapter (TAA). The first stage of TAMM fine-tunes the CLIP model to better align 3D shapes with image and text features, mitigating the domain gap between 3D-rendered images and natural images. The second stage decouples 3D features into two sub-spaces: one focusing on visual attributes and the other on semantic understanding, enabling more comprehensive and effective multi-modal pre-training. TAMM consistently enhances 3D representations across various 3D encoder architectures, pre-training datasets, and downstream tasks. It achieves a 3.3% improvement in zero-shot classification accuracy on Objaverse-LVIS and a 2.9% improvement on ModelNet40. The method also improves linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%. TAMM's effectiveness is demonstrated through extensive experiments on multiple benchmarks, including Objaverse-LVIS, ModelNet40, ScanObjectNN, and ScanNet. The method's key contributions include identifying the under-utilization of the 2D image modality in existing multi-modal methods, proposing a novel multi-modal learning framework with two stages and three unified adapter modules, and demonstrating that TAMM consistently enhances 3D representations for various 3D encoder architectures, pre-training datasets, and downstream tasks. The method's success is attributed to its ability to effectively leverage both image and language modalities, leading to more robust and generalizable 3D representations.
Reach us at info@study.space
[slides] TAMM%3A TriAdapter Multi-Modal Learning for 3D Shape Understanding | StudySpace