22 Mar 2024 | Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, Wenguang Wang
IS-FUSION is a novel multimodal fusion framework for 3D object detection that jointly captures instance-level and scene-level contextual information. Unlike existing methods that focus on scene-level fusion, IS-FUSION explicitly incorporates instance-level multimodal information, enabling instance-centric tasks like 3D object detection. The framework consists of two key modules: the Hierarchical Scene Fusion (HSF) module and the Instance-Guided Fusion (IGF) module. HSF uses Point-to-Grid and Grid-to-Region transformers to capture scene context at different granularities, while IGF mines instance candidates, explores their relationships, and aggregates local multimodal context for each instance. These instances then guide the scene feature to enhance the BEV representation. On the challenging nuScenes benchmark, IS-FUSION outperforms all published multimodal works, achieving 72.8% mAP on the validation set. The framework demonstrates superior performance in both detection accuracy and inference speed, with extensive experiments showing its effectiveness in handling various object categories and challenging scenarios. IS-FUSION provides a fresh perspective for BEV-based perception models by emphasizing instance-level context, which is beneficial for a wide range of instance-centric tasks.IS-FUSION is a novel multimodal fusion framework for 3D object detection that jointly captures instance-level and scene-level contextual information. Unlike existing methods that focus on scene-level fusion, IS-FUSION explicitly incorporates instance-level multimodal information, enabling instance-centric tasks like 3D object detection. The framework consists of two key modules: the Hierarchical Scene Fusion (HSF) module and the Instance-Guided Fusion (IGF) module. HSF uses Point-to-Grid and Grid-to-Region transformers to capture scene context at different granularities, while IGF mines instance candidates, explores their relationships, and aggregates local multimodal context for each instance. These instances then guide the scene feature to enhance the BEV representation. On the challenging nuScenes benchmark, IS-FUSION outperforms all published multimodal works, achieving 72.8% mAP on the validation set. The framework demonstrates superior performance in both detection accuracy and inference speed, with extensive experiments showing its effectiveness in handling various object categories and challenging scenarios. IS-FUSION provides a fresh perspective for BEV-based perception models by emphasizing instance-level context, which is beneficial for a wide range of instance-centric tasks.