[slides and audio] Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

The paper introduces a lookahead exploration strategy for continuous vision-and-language navigation (VLN) tasks, focusing on improving the agent's ability to navigate to remote locations based on natural language instructions in 3D environments. The proposed method, called hierarchical neural radiance representation (HNR), aims to predict multi-level semantic features of future environments more robustly and efficiently compared to traditional pixel-wise RGB reconstruction. HNR uses a pre-trained CLIP model to extract fine-grained grid features from observed environments, which are then encoded into a feature cloud. These features are aggregated using volume rendering techniques to produce latent vectors and volume density, forming multi-level semantic representations of future environments. The lookahead VLN model evaluates possible future paths by constructing a navigable future path tree and selecting the optimal path through efficient parallel evaluation. Extensive experiments on the R2R-CE and RxR-CE datasets demonstrate the effectiveness of the proposed method, showing significant improvements over existing approaches in terms of trajectory length, navigation error, success rate, and other metrics. The code for the method is available at <https://github.com/MrZihan/HNR-VLN>.The paper introduces a lookahead exploration strategy for continuous vision-and-language navigation (VLN) tasks, focusing on improving the agent's ability to navigate to remote locations based on natural language instructions in 3D environments. The proposed method, called hierarchical neural radiance representation (HNR), aims to predict multi-level semantic features of future environments more robustly and efficiently compared to traditional pixel-wise RGB reconstruction. HNR uses a pre-trained CLIP model to extract fine-grained grid features from observed environments, which are then encoded into a feature cloud. These features are aggregated using volume rendering techniques to produce latent vectors and volume density, forming multi-level semantic representations of future environments. The lookahead VLN model evaluates possible future paths by constructing a navigable future path tree and selecting the optimal path through efficient parallel evaluation. Extensive experiments on the R2R-CE and RxR-CE datasets demonstrate the effectiveness of the proposed method, showing significant improvements over existing approaches in terms of trajectory length, navigation error, success rate, and other metrics. The code for the method is available at <https://github.com/MrZihan/HNR-VLN>.

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

2 Apr 2024 | Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, Shuqiang Jiang