Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

2 Apr 2024 | Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, Shuqiang Jiang
This paper proposes a lookahead exploration method for continuous vision-language navigation (VLN). The proposed hierarchical neural radiance (HNR) representation model generates multi-level semantic features for future environments, which are more robust and efficient than pixel-wise RGB reconstruction. The HNR model uses a vision-language embedding model (CLIP) to compress redundant information and extract critical visual semantics. The model encodes observed environments into a feature cloud, and uses volume rendering and hierarchical encoding to predict future environmental representations. These representations are then used to construct a navigable future path tree and select the optimal path via efficient parallel evaluation. The lookahead VLN model evaluates possible future paths in the path tree and selects the optimal candidate locations. The model is evaluated on the R2R-CE and RxR-CE datasets, demonstrating state-of-the-art performance in most metrics. The HNR model outperforms existing methods in terms of success rate, SPL, and other evaluation metrics. The model also shows better performance in handling visual occlusions and predicting future environments. The HNR model is efficient and effective for continuous VLN tasks, and has great research potential for VLN and embodied AI tasks.This paper proposes a lookahead exploration method for continuous vision-language navigation (VLN). The proposed hierarchical neural radiance (HNR) representation model generates multi-level semantic features for future environments, which are more robust and efficient than pixel-wise RGB reconstruction. The HNR model uses a vision-language embedding model (CLIP) to compress redundant information and extract critical visual semantics. The model encodes observed environments into a feature cloud, and uses volume rendering and hierarchical encoding to predict future environmental representations. These representations are then used to construct a navigable future path tree and select the optimal path via efficient parallel evaluation. The lookahead VLN model evaluates possible future paths in the path tree and selects the optimal candidate locations. The model is evaluated on the R2R-CE and RxR-CE datasets, demonstrating state-of-the-art performance in most metrics. The HNR model outperforms existing methods in terms of success rate, SPL, and other evaluation metrics. The model also shows better performance in handling visual occlusions and predicting future environments. The HNR model is efficient and effective for continuous VLN tasks, and has great research potential for VLN and embodied AI tasks.
Reach us at info@study.space
[slides] Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation | StudySpace