GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

2024 | Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, Xinghao Chen
GeminiFusion is a pixel-wise multimodal fusion approach for Vision Transformers that efficiently integrates information across different modalities. The paper critiques prior token exchange methods, which replace less informative tokens with inter-modal features, and demonstrates that cross-attention mechanisms outperform these methods, although they are computationally expensive. To address this, GeminiFusion is proposed, which dynamically integrates complementary information across modalities by combining intra-modal and inter-modal attentions. It uses layer-adaptive noise to control the interplay of features on a per-layer basis, achieving linear complexity with respect to the number of input tokens, ensuring efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection, and arbitrary-modal semantic segmentation tasks show that GeminiFusion outperforms leading techniques. The method is efficient, maintaining performance while reducing computational overhead. It is designed to be plug-and-play, allowing seamless integration into various vision backbones. The paper also discusses the effectiveness of GeminiFusion in different frameworks, including Swin Transformer, and demonstrates its ability to handle highly aligned modalities. Experiments show that GeminiFusion achieves state-of-the-art performance on various multimodal semantic segmentation benchmarks and is effective in image-to-image translation and 3D object detection tasks. The method is efficient, with linear complexity, and outperforms TokenFusion in terms of both performance and inference latency. The paper concludes that GeminiFusion is a promising approach for multimodal fusion, offering a balance between efficiency and effectiveness.GeminiFusion is a pixel-wise multimodal fusion approach for Vision Transformers that efficiently integrates information across different modalities. The paper critiques prior token exchange methods, which replace less informative tokens with inter-modal features, and demonstrates that cross-attention mechanisms outperform these methods, although they are computationally expensive. To address this, GeminiFusion is proposed, which dynamically integrates complementary information across modalities by combining intra-modal and inter-modal attentions. It uses layer-adaptive noise to control the interplay of features on a per-layer basis, achieving linear complexity with respect to the number of input tokens, ensuring efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection, and arbitrary-modal semantic segmentation tasks show that GeminiFusion outperforms leading techniques. The method is efficient, maintaining performance while reducing computational overhead. It is designed to be plug-and-play, allowing seamless integration into various vision backbones. The paper also discusses the effectiveness of GeminiFusion in different frameworks, including Swin Transformer, and demonstrates its ability to handle highly aligned modalities. Experiments show that GeminiFusion achieves state-of-the-art performance on various multimodal semantic segmentation benchmarks and is effective in image-to-image translation and 3D object detection tasks. The method is efficient, with linear complexity, and outperforms TokenFusion in terms of both performance and inference latency. The paper concludes that GeminiFusion is a promising approach for multimodal fusion, offering a balance between efficiency and effectiveness.
Reach us at info@study.space
[slides and audio] GeminiFusion%3A Efficient Pixel-wise Multimodal Fusion for Vision Transformer