NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

30 Jun 2024 | Jiazhao Zhang,1,2,* Kunyu Wang,2,* Rongtao Xu,2,3,* Gengze Zhou,4 Yicong Hong,5 Xiaomeng Fang,2 Qi Wu,4 Zhizheng Zhang,6,† He Wang,1,2,6,†
NaVid is a video-based large vision language model (VLM) designed to address the challenges of generalization in vision-and-language navigation (VLN). The paper introduces NaVid, which aims to achieve state-of-the-art navigation performance without relying on maps, odometers, or depth inputs. NaVid processes monocular RGB video streams from a robot's camera and human instructions to plan the next step in navigation. The model encodes historical observations as spatio-temporal contexts, leveraging 510k navigation samples and 763k web data for training. Extensive experiments show that NaVid outperforms existing methods in both simulated and real-world environments, demonstrating superior cross-dataset and Sim-to-Real transfer capabilities. The paper also discusses the effectiveness of different components and training strategies, highlighting the importance of co-tuning with navigation data and the use of special tokens for task-specific modeling.NaVid is a video-based large vision language model (VLM) designed to address the challenges of generalization in vision-and-language navigation (VLN). The paper introduces NaVid, which aims to achieve state-of-the-art navigation performance without relying on maps, odometers, or depth inputs. NaVid processes monocular RGB video streams from a robot's camera and human instructions to plan the next step in navigation. The model encodes historical observations as spatio-temporal contexts, leveraging 510k navigation samples and 763k web data for training. Extensive experiments show that NaVid outperforms existing methods in both simulated and real-world environments, demonstrating superior cross-dataset and Sim-to-Real transfer capabilities. The paper also discusses the effectiveness of different components and training strategies, highlighting the importance of co-tuning with navigation data and the use of special tokens for task-specific modeling.
Reach us at info@study.space