Understanding GeminiFusion%3A Efficient Pixel-wise Multimodal Fusion for Vision Transformer

The paper "GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer" addresses the challenges of cross-modal fusion in vision tasks, particularly in the context of transformers. It critiques prior token exchange methods that replace less informative tokens with inter-modal features, noting that these methods underperform cross-attention mechanisms while incurring high computational costs. To overcome these limitations, the authors propose *GeminiFusion*, a pixel-wise fusion approach that leverages aligned cross-modal representations. *GeminiFusion* combines intra-modal and inter-modal attentions dynamically, integrating complementary information across modalities. The method uses layer-adaptive noise to control the interplay between intra-modal and inter-modal attentions, ensuring a harmonized fusion process. Notably, *GeminiFusion* maintains linear complexity with respect to the number of input tokens, making it efficient and comparable to unimodal networks. The paper evaluates *GeminiFusion* across various multimodal tasks, including image-to-image translation, 3D object detection, and semantic segmentation using different modalities such as RGB, depth, LiDAR, and event data. Extensive experiments demonstrate that *GeminiFusion* outperforms leading techniques, achieving superior performance in terms of accuracy and efficiency. The method is also shown to be effective in handling highly aligned modalities, as evidenced by its performance on benchmark datasets like NYUDv2, SUN RGB-D, and DeLiVER. Additionally, *GeminiFusion* can be seamlessly integrated into existing multimodal architectures, such as the Swin Transformer, without degrading accuracy. The contributions of the paper include: 1. Empirical evidence that direct feature replacement between modalities is sub-optimal. 2. The introduction of *GeminiFusion*, an efficient pixel-wise fusion method that leverages aligned cross-modal representations. 3. Extensive experimental validation of *GeminiFusion* across various multimodal tasks, demonstrating its effectiveness and efficiency.The paper "GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer" addresses the challenges of cross-modal fusion in vision tasks, particularly in the context of transformers. It critiques prior token exchange methods that replace less informative tokens with inter-modal features, noting that these methods underperform cross-attention mechanisms while incurring high computational costs. To overcome these limitations, the authors propose *GeminiFusion*, a pixel-wise fusion approach that leverages aligned cross-modal representations. *GeminiFusion* combines intra-modal and inter-modal attentions dynamically, integrating complementary information across modalities. The method uses layer-adaptive noise to control the interplay between intra-modal and inter-modal attentions, ensuring a harmonized fusion process. Notably, *GeminiFusion* maintains linear complexity with respect to the number of input tokens, making it efficient and comparable to unimodal networks. The paper evaluates *GeminiFusion* across various multimodal tasks, including image-to-image translation, 3D object detection, and semantic segmentation using different modalities such as RGB, depth, LiDAR, and event data. Extensive experiments demonstrate that *GeminiFusion* outperforms leading techniques, achieving superior performance in terms of accuracy and efficiency. The method is also shown to be effective in handling highly aligned modalities, as evidenced by its performance on benchmark datasets like NYUDv2, SUN RGB-D, and DeLiVER. Additionally, *GeminiFusion* can be seamlessly integrated into existing multimodal architectures, such as the Swin Transformer, without degrading accuracy. The contributions of the paper include: 1. Empirical evidence that direct feature replacement between modalities is sub-optimal. 2. The introduction of *GeminiFusion*, an efficient pixel-wise fusion method that leverages aligned cross-modal representations. 3. Extensive experimental validation of *GeminiFusion* across various multimodal tasks, demonstrating its effectiveness and efficiency.

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

2024 | Ding Jia * 1 Jianyuan Guo * 2 Kai Han 3 Han Wu 2 Chao Zhang 1 Chang Xu*2 Xinghao Chen*2 3

2024 | Ding Jia * 1 Jianyuan Guo * 2 Kai Han 3 Han Wu 2 Chao Zhang 1 Chang Xu2 Xinghao Chen2 3