Understanding ScanReason%3A Empowering 3D Visual Grounding with Reasoning Capabilities

The paper "ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities" introduces a new task called 3D reasoning grounding, which requires models to reason over complex and implicit human instructions, localize target objects, and provide corresponding explanations. The authors propose a new benchmark named ScanReason, comprising over 10,000 question-answer-location pairs from five reasoning types: spatial, functional, logical, emotional, and safety reasoning. These questions are designed to test the model's ability to understand and reason about the 3D environment based on implicit human instructions. To address this challenge, the authors propose ReGround3D, a framework that combines a visual-centric reasoning module and a 3D grounding module. The visual-centric reasoning module uses a Multi-modal Large Language Model (MLLM) to perform joint reasoning of the 3D scene and instructions, predicting a special token representing the semantic and location information of the target object. The 3D grounding module then uses this token to look back at the 3D scene and enhance its geometry and fine-grained details to accurately locate the target object. The Chain-of-Grounding (CoG) mechanism is introduced to further boost performance by alternating between reasoning and grounding steps during inference. Extensive experiments on the ScanReason benchmark validate the effectiveness of the proposed approach, showing superior performance compared to existing methods in 3D visual grounding and 3D reasoning grounding tasks. The paper also includes an ablation study to demonstrate the effectiveness of each component of the proposed framework.The paper "ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities" introduces a new task called 3D reasoning grounding, which requires models to reason over complex and implicit human instructions, localize target objects, and provide corresponding explanations. The authors propose a new benchmark named ScanReason, comprising over 10,000 question-answer-location pairs from five reasoning types: spatial, functional, logical, emotional, and safety reasoning. These questions are designed to test the model's ability to understand and reason about the 3D environment based on implicit human instructions. To address this challenge, the authors propose ReGround3D, a framework that combines a visual-centric reasoning module and a 3D grounding module. The visual-centric reasoning module uses a Multi-modal Large Language Model (MLLM) to perform joint reasoning of the 3D scene and instructions, predicting a special token representing the semantic and location information of the target object. The 3D grounding module then uses this token to look back at the 3D scene and enhance its geometry and fine-grained details to accurately locate the target object. The Chain-of-Grounding (CoG) mechanism is introduced to further boost performance by alternating between reasoning and grounding steps during inference. Extensive experiments on the ScanReason benchmark validate the effectiveness of the proposed approach, showing superior performance compared to existing methods in 3D visual grounding and 3D reasoning grounding tasks. The paper also includes an ablation study to demonstrate the effectiveness of each component of the proposed framework.

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

17 Jul 2024 | Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, and Xihui Liu