22 Mar 2024 | Rao Fu1*, Jingyu Liu2**, Xilun Chen3, Yixin Nie3, and Wenhan Xiong3
This paper introduces Scene-LLM, a 3D-visual-language model designed to enhance embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM employs a hybrid 3D visual feature representation that incorporates dense spatial information and supports scene state updates. The model uses a projection layer to efficiently project these features into the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. A unique aspect of Scene-LLM is its integration of both scene-level and egocentric 3D information, which is crucial for interactive planning. Egocentric 3D frame features are used for feature alignment, an efficient technique that incorporates fine-grained concepts. Experiments demonstrate Scene-LLM's strong capabilities in dense captioning, question answering, and interactive planning, advancing the field of 3D visual understanding and reasoning. The paper also presents a large-scale dataset for 3D and text feature alignment, comprising $190k$ 3D-visual-language pairs from an egocentric viewpoint and about $500k$ pairs of scene-level data.This paper introduces Scene-LLM, a 3D-visual-language model designed to enhance embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM employs a hybrid 3D visual feature representation that incorporates dense spatial information and supports scene state updates. The model uses a projection layer to efficiently project these features into the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. A unique aspect of Scene-LLM is its integration of both scene-level and egocentric 3D information, which is crucial for interactive planning. Egocentric 3D frame features are used for feature alignment, an efficient technique that incorporates fine-grained concepts. Experiments demonstrate Scene-LLM's strong capabilities in dense captioning, question answering, and interactive planning, advancing the field of 3D visual understanding and reasoning. The paper also presents a large-scale dataset for 3D and text feature alignment, comprising $190k$ 3D-visual-language pairs from an egocentric viewpoint and about $500k$ pairs of scene-level data.