30 Jun 2024 | Jiazha0 Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, He Wang
NaVid is a video-based large vision language model (VLM) designed for vision-and-language navigation (VLN) in continuous environments. It enables agents to navigate in unseen environments following linguistic instructions without requiring maps, odometers, or depth inputs. NaVid uses only RGB video streams from a monocular camera to output the next-step action, mimicking human navigation and avoiding issues like odometer noise and Sim-to-Real gaps. The model encodes historical observations as spatio-temporal contexts for decision-making and instruction following. It is trained on 510k navigation samples and 763k web data, achieving state-of-the-art performance in both simulation and real-world environments. NaVid outperforms existing methods in cross-dataset and Sim-to-Real transfer, demonstrating superior generalization. The model's video-based approach allows it to handle complex instructions and navigate in diverse indoor scenes. NaVid's architecture includes a vision encoder, query generator, LLM, and cross-modality projectors. It represents each frame with instruction-queried and instruction-agnostic tokens, enabling the LLM to reason about navigation actions. The model is trained with a hybrid strategy, incorporating non-oracle navigation trajectories and auxiliary tasks to enhance performance. NaVid achieves high success rates in real-world environments, showing robustness and adaptability. The results demonstrate that NaVid is effective in navigating complex environments and following free-form language instructions, making it a significant advancement in VLN research.NaVid is a video-based large vision language model (VLM) designed for vision-and-language navigation (VLN) in continuous environments. It enables agents to navigate in unseen environments following linguistic instructions without requiring maps, odometers, or depth inputs. NaVid uses only RGB video streams from a monocular camera to output the next-step action, mimicking human navigation and avoiding issues like odometer noise and Sim-to-Real gaps. The model encodes historical observations as spatio-temporal contexts for decision-making and instruction following. It is trained on 510k navigation samples and 763k web data, achieving state-of-the-art performance in both simulation and real-world environments. NaVid outperforms existing methods in cross-dataset and Sim-to-Real transfer, demonstrating superior generalization. The model's video-based approach allows it to handle complex instructions and navigate in diverse indoor scenes. NaVid's architecture includes a vision encoder, query generator, LLM, and cross-modality projectors. It represents each frame with instruction-queried and instruction-agnostic tokens, enabling the LLM to reason about navigation actions. The model is trained with a hybrid strategy, incorporating non-oracle navigation trajectories and auxiliary tasks to enhance performance. NaVid achieves high success rates in real-world environments, showing robustness and adaptability. The results demonstrate that NaVid is effective in navigating complex environments and following free-form language instructions, making it a significant advancement in VLN research.