22 Mar 2024 | Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, Wenguan Wang
**IS-FUSION: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection**
This paper introduces IS-FUSION, an innovative multimodal fusion framework designed to enhance 3D object detection in autonomous driving scenarios. The framework addresses the limitations of traditional bird's eye view (BEV) representations, which often result in small object sizes and sparse point cloud contexts, leading to challenges in reliable 3D perception. IS-FUSION integrates instance-level and scene-level contextual information by employing two key modules: Hierarchical Scene Fusion (HSF) and Instance-Guided Fusion (IGF).
1. **Hierarchical Scene Fusion (HSF)**: HSF captures scene context at different granularities using Point-to-Grid and Grid-to-Region transformers. This module generates high-quality instance-level features and enables the integration of multi-granularity scene features.
2. **Instance-Guided Fusion (IGF)**: IGF identifies instance candidates, explores their relationships, and aggregates local multimodal context for each instance. It enhances the scene feature by incorporating instance-level information, improving the overall BEV representation.
**Key Contributions:**
- **Instance-Level and Scene-Level Context Integration**: IS-FUSION explicitly promotes collaboration between instance and scene features, ensuring comprehensive representation and improved detection performance.
- **Enhanced BEV Representation**: The framework yields an instance-aware BEV representation, which is better suited for instance-centric tasks like 3D object detection.
- **Superior Performance**: On the nuScenes benchmark, IS-FUSION outperforms all published multimodal works, achieving 72.8% mAP on the validation set, surpassing prior art by up to 4.3% mAP.
**Methodology:**
- **Input Encoding**: Modality-specific encoders (VoxelNet for LiDAR and Swin-Transformer for images) are used to obtain initial representations.
- **Multimodal Encoder**: The multimodal encoder combines point cloud and image features using HSF and IGF modules.
- **Multimodal Decoder**: The decoder generates final 3D detections based on the enhanced BEV representation.
**Experiments:**
- **Dataset**: nuScenes, a large-scale autonomous driving dataset.
- **Network Architecture**: Implementation follows MMDetection3D, with specific hyperparameters for point cloud and image encoders.
- **Training**: End-to-end training with AdamW optimizer, once-cycle learning policy, and cross-modal data augmentation.
**Ablation Studies:**
- **Component-wise Ablation**: Evaluates the impact of each module and hyperparameters.
- **HSF Analysis**: Shows the effectiveness of different feature granularities.
- **IGF Analysis**: Demonstrates the importance of instance-level modeling and the optimal hyperparameters.
**Conclusion:**
IS-FUSION provides a fresh perspective on BEV-based perception models by emphasizing instance-level context, offering significant improvements in**IS-FUSION: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection**
This paper introduces IS-FUSION, an innovative multimodal fusion framework designed to enhance 3D object detection in autonomous driving scenarios. The framework addresses the limitations of traditional bird's eye view (BEV) representations, which often result in small object sizes and sparse point cloud contexts, leading to challenges in reliable 3D perception. IS-FUSION integrates instance-level and scene-level contextual information by employing two key modules: Hierarchical Scene Fusion (HSF) and Instance-Guided Fusion (IGF).
1. **Hierarchical Scene Fusion (HSF)**: HSF captures scene context at different granularities using Point-to-Grid and Grid-to-Region transformers. This module generates high-quality instance-level features and enables the integration of multi-granularity scene features.
2. **Instance-Guided Fusion (IGF)**: IGF identifies instance candidates, explores their relationships, and aggregates local multimodal context for each instance. It enhances the scene feature by incorporating instance-level information, improving the overall BEV representation.
**Key Contributions:**
- **Instance-Level and Scene-Level Context Integration**: IS-FUSION explicitly promotes collaboration between instance and scene features, ensuring comprehensive representation and improved detection performance.
- **Enhanced BEV Representation**: The framework yields an instance-aware BEV representation, which is better suited for instance-centric tasks like 3D object detection.
- **Superior Performance**: On the nuScenes benchmark, IS-FUSION outperforms all published multimodal works, achieving 72.8% mAP on the validation set, surpassing prior art by up to 4.3% mAP.
**Methodology:**
- **Input Encoding**: Modality-specific encoders (VoxelNet for LiDAR and Swin-Transformer for images) are used to obtain initial representations.
- **Multimodal Encoder**: The multimodal encoder combines point cloud and image features using HSF and IGF modules.
- **Multimodal Decoder**: The decoder generates final 3D detections based on the enhanced BEV representation.
**Experiments:**
- **Dataset**: nuScenes, a large-scale autonomous driving dataset.
- **Network Architecture**: Implementation follows MMDetection3D, with specific hyperparameters for point cloud and image encoders.
- **Training**: End-to-end training with AdamW optimizer, once-cycle learning policy, and cross-modal data augmentation.
**Ablation Studies:**
- **Component-wise Ablation**: Evaluates the impact of each module and hyperparameters.
- **HSF Analysis**: Shows the effectiveness of different feature granularities.
- **IGF Analysis**: Demonstrates the importance of instance-level modeling and the optimal hyperparameters.
**Conclusion:**
IS-FUSION provides a fresh perspective on BEV-based perception models by emphasizing instance-level context, offering significant improvements in