[slides and audio] Volumetric Environment Representation for Vision-Language Navigation

The paper introduces a Volumetric Environment Representation (VER) for Vision-Language Navigation (VLN), which voxelizes the physical world into structured 3D cells to capture both geometric and semantic details. VER aggregates multi-view 2D features into these 3D cells through 2D-3D sampling, enabling comprehensive scene understanding. The agent uses VER for 3D occupancy prediction, room layout estimation, and 3D object detection, and performs volume state estimation and episodic memory building for decision-making. Experimental results on VLN benchmarks (R2R, REVERIE, and R4R) show that VER significantly improves performance, achieving state-of-the-art results. The coarse-to-fine extraction architecture and multi-task learning further enhance the model's effectiveness.The paper introduces a Volumetric Environment Representation (VER) for Vision-Language Navigation (VLN), which voxelizes the physical world into structured 3D cells to capture both geometric and semantic details. VER aggregates multi-view 2D features into these 3D cells through 2D-3D sampling, enabling comprehensive scene understanding. The agent uses VER for 3D occupancy prediction, room layout estimation, and 3D object detection, and performs volume state estimation and episodic memory building for decision-making. Experimental results on VLN benchmarks (R2R, REVERIE, and R4R) show that VER significantly improves performance, achieving state-of-the-art results. The coarse-to-fine extraction architecture and multi-task learning further enhance the model's effectiveness.

Volumetric Environment Representation for Vision-Language Navigation

21 Mar 2024 | Rui Liu, Wenguan Wang, Yi Yang