Volumetric Environment Representation for Vision-Language Navigation

Volumetric Environment Representation for Vision-Language Navigation

21 Mar 2024 | Rui Liu, Wenguan Wang, Yi Yang
This paper proposes a Volumetric Environment Representation (VER) for Vision-Language Navigation (VLN). VLN requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. The key challenge is to achieve comprehensive scene understanding for effective navigation. Previous methods rely on 2D features, which fail to capture 3D geometry and semantics, leading to incomplete environment representations. To address this, the authors introduce VER, which voxelizes the physical world into structured 3D cells. Each cell aggregates multi-view 2D features into a unified 3D space via 2D-3D sampling. This allows the agent to predict 3D occupancy, room layout, and 3D bounding boxes jointly. The agent performs volume state estimation and builds episodic memory to predict the next step. The VER is trained using multi-task learning, which enables the agent to learn 3D perception tasks such as occupancy prediction, room layout estimation, and object detection. The agent combines local action probabilities derived from the volume state with global action probabilities from the episodic memory for decision-making. The proposed method is evaluated on three VLN benchmarks: R2R, REVERIE, and R4R. The results show that the environment representations from multi-task learning lead to significant performance gains. The model achieves state-of-the-art performance across these benchmarks. The VER provides a powerful representation for both 3D perception tasks and VLN, enabling the agent to make comprehensive decisions in the 3D space. The agent can navigate through complex environments by leveraging the structured 3D representation and episodic memory. The method demonstrates the effectiveness of using a volumetric environment representation for VLN, achieving promising results in terms of navigation accuracy and efficiency.This paper proposes a Volumetric Environment Representation (VER) for Vision-Language Navigation (VLN). VLN requires an agent to navigate through a 3D environment based on visual observations and natural language instructions. The key challenge is to achieve comprehensive scene understanding for effective navigation. Previous methods rely on 2D features, which fail to capture 3D geometry and semantics, leading to incomplete environment representations. To address this, the authors introduce VER, which voxelizes the physical world into structured 3D cells. Each cell aggregates multi-view 2D features into a unified 3D space via 2D-3D sampling. This allows the agent to predict 3D occupancy, room layout, and 3D bounding boxes jointly. The agent performs volume state estimation and builds episodic memory to predict the next step. The VER is trained using multi-task learning, which enables the agent to learn 3D perception tasks such as occupancy prediction, room layout estimation, and object detection. The agent combines local action probabilities derived from the volume state with global action probabilities from the episodic memory for decision-making. The proposed method is evaluated on three VLN benchmarks: R2R, REVERIE, and R4R. The results show that the environment representations from multi-task learning lead to significant performance gains. The model achieves state-of-the-art performance across these benchmarks. The VER provides a powerful representation for both 3D perception tasks and VLN, enabling the agent to make comprehensive decisions in the 3D space. The agent can navigate through complex environments by leveraging the structured 3D representation and episodic memory. The method demonstrates the effectiveness of using a volumetric environment representation for VLN, achieving promising results in terms of navigation accuracy and efficiency.
Reach us at info@study.space
[slides] Volumetric Environment Representation for Vision-Language Navigation | StudySpace