2 Apr 2024 | Zhihao Zhang, Shengcao Cao, Yu-Xiong Wang
TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
This paper proposes TAMM, a novel two-stage multi-modal learning approach for 3D shape understanding. TAMM addresses the under-utilization of 2D images in existing multi-modal methods by introducing three synergistic adapters: CLIP Image Adapter (CIA), Image Alignment Adapter (IAA), and Text Alignment Adapter (TAA). The first stage of TAMM fine-tunes the CLIP model to better align 3D shapes with image and text features, mitigating the domain gap between 3D-rendered images and natural images. The second stage decouples 3D features into two sub-spaces: one focusing on visual attributes and the other on semantic understanding, enabling more comprehensive and effective multi-modal pre-training.
TAMM consistently enhances 3D representations across various 3D encoder architectures, pre-training datasets, and downstream tasks. It achieves a 3.3% improvement in zero-shot classification accuracy on Objaverse-LVIS and a 2.9% improvement on ModelNet40. The method also improves linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%. TAMM's effectiveness is demonstrated through extensive experiments on multiple benchmarks, including Objaverse-LVIS, ModelNet40, ScanObjectNN, and ScanNet.
The method's key contributions include identifying the under-utilization of the 2D image modality in existing multi-modal methods, proposing a novel multi-modal learning framework with two stages and three unified adapter modules, and demonstrating that TAMM consistently enhances 3D representations for various 3D encoder architectures, pre-training datasets, and downstream tasks. The method's success is attributed to its ability to effectively leverage both image and language modalities, leading to more robust and generalizable 3D representations.TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
This paper proposes TAMM, a novel two-stage multi-modal learning approach for 3D shape understanding. TAMM addresses the under-utilization of 2D images in existing multi-modal methods by introducing three synergistic adapters: CLIP Image Adapter (CIA), Image Alignment Adapter (IAA), and Text Alignment Adapter (TAA). The first stage of TAMM fine-tunes the CLIP model to better align 3D shapes with image and text features, mitigating the domain gap between 3D-rendered images and natural images. The second stage decouples 3D features into two sub-spaces: one focusing on visual attributes and the other on semantic understanding, enabling more comprehensive and effective multi-modal pre-training.
TAMM consistently enhances 3D representations across various 3D encoder architectures, pre-training datasets, and downstream tasks. It achieves a 3.3% improvement in zero-shot classification accuracy on Objaverse-LVIS and a 2.9% improvement on ModelNet40. The method also improves linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%. TAMM's effectiveness is demonstrated through extensive experiments on multiple benchmarks, including Objaverse-LVIS, ModelNet40, ScanObjectNN, and ScanNet.
The method's key contributions include identifying the under-utilization of the 2D image modality in existing multi-modal methods, proposing a novel multi-modal learning framework with two stages and three unified adapter modules, and demonstrating that TAMM consistently enhances 3D representations for various 3D encoder architectures, pre-training datasets, and downstream tasks. The method's success is attributed to its ability to effectively leverage both image and language modalities, leading to more robust and generalizable 3D representations.