NaVid is a video-based large vision language model (VLM) designed to address the challenges of generalization in vision-and-language navigation (VLN). The paper introduces NaVid, which aims to achieve state-of-the-art navigation performance without relying on maps, odometers, or depth inputs. NaVid processes monocular RGB video streams from a robot's camera and human instructions to plan the next step in navigation. The model encodes historical observations as spatio-temporal contexts, leveraging 510k navigation samples and 763k web data for training. Extensive experiments show that NaVid outperforms existing methods in both simulated and real-world environments, demonstrating superior cross-dataset and Sim-to-Real transfer capabilities. The paper also discusses the effectiveness of different components and training strategies, highlighting the importance of co-tuning with navigation data and the use of special tokens for task-specific modeling.NaVid is a video-based large vision language model (VLM) designed to address the challenges of generalization in vision-and-language navigation (VLN). The paper introduces NaVid, which aims to achieve state-of-the-art navigation performance without relying on maps, odometers, or depth inputs. NaVid processes monocular RGB video streams from a robot's camera and human instructions to plan the next step in navigation. The model encodes historical observations as spatio-temporal contexts, leveraging 510k navigation samples and 763k web data for training. Extensive experiments show that NaVid outperforms existing methods in both simulated and real-world environments, demonstrating superior cross-dataset and Sim-to-Real transfer capabilities. The paper also discusses the effectiveness of different components and training strategies, highlighting the importance of co-tuning with navigation data and the use of special tokens for task-specific modeling.