17 Jul 2024 | Chenming Zhu, Tai Wang, Wenwei Zhang, Kai Chen, and Xihui Liu
This paper introduces a new task called 3D reasoning grounding, which requires models to perform reasoning over complex and implicit human instructions, localize target objects, and provide explanations. To evaluate this task, we introduce a new benchmark called ScanReason, which contains over 10,000 question-answer-location pairs from five reasoning types. We propose a new approach called ReGround3D, which combines a visual-centric reasoning module powered by a multi-modal large language model (MLLM) and a 3D grounding module to accurately locate objects in 3D scenes. A chain-of-grounding mechanism is introduced to further improve performance by interleaving reasoning and grounding steps during inference. The visual-centric reasoning module conducts joint reasoning of the 3D scene and instructions with an MLLM, predicting a special token representing the semantic and location information of the target object. The 3D grounding module uses the output token embedding from the previous reasoning module to locate the target object by looking back at the fine-grained 3D scene representation. The chain-of-grounding mechanism synergizes reasoning and grounding, allowing multiple rounds of alternating reasoning and grounding during inference. Our contributions include proposing the 3D reasoning grounding task, introducing the ScanReason benchmark, and designing the ReGround3D framework with a visual-centric reasoning module and a 3D grounding module with geometry-enhanced look-back. Extensive experiments on the ScanReason benchmark validate the effectiveness of our approach.This paper introduces a new task called 3D reasoning grounding, which requires models to perform reasoning over complex and implicit human instructions, localize target objects, and provide explanations. To evaluate this task, we introduce a new benchmark called ScanReason, which contains over 10,000 question-answer-location pairs from five reasoning types. We propose a new approach called ReGround3D, which combines a visual-centric reasoning module powered by a multi-modal large language model (MLLM) and a 3D grounding module to accurately locate objects in 3D scenes. A chain-of-grounding mechanism is introduced to further improve performance by interleaving reasoning and grounding steps during inference. The visual-centric reasoning module conducts joint reasoning of the 3D scene and instructions with an MLLM, predicting a special token representing the semantic and location information of the target object. The 3D grounding module uses the output token embedding from the previous reasoning module to locate the target object by looking back at the fine-grained 3D scene representation. The chain-of-grounding mechanism synergizes reasoning and grounding, allowing multiple rounds of alternating reasoning and grounding during inference. Our contributions include proposing the 3D reasoning grounding task, introducing the ScanReason benchmark, and designing the ReGround3D framework with a visual-centric reasoning module and a 3D grounding module with geometry-enhanced look-back. Extensive experiments on the ScanReason benchmark validate the effectiveness of our approach.