5 Apr 2018 | Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, Anton van den Hengel
Vision-and-Language Navigation (VLN) involves interpreting visually grounded navigation instructions in real environments. This paper introduces the Matterport3D Simulator, a large-scale reinforcement learning environment based on real imagery, and the Room-to-Room (R2R) dataset, the first benchmark for VLN in real, previously unseen building-scale 3D environments. The R2R dataset contains 21,567 open-vocabulary navigation instructions with an average length of 29 words, describing trajectories through multiple rooms. The simulator enables the evaluation of VLN methods by providing a realistic environment for agents to navigate based on natural language instructions. The paper also presents a sequence-to-sequence neural network model for VLN, along with several baselines, and evaluates their performance on the R2R dataset. Results show that while the model performs well in previously seen environments, generalization to unseen environments is challenging. The study highlights the importance of visual grounding in navigation tasks and the need for further research in this area. The simulator and dataset are available for use in future research.Vision-and-Language Navigation (VLN) involves interpreting visually grounded navigation instructions in real environments. This paper introduces the Matterport3D Simulator, a large-scale reinforcement learning environment based on real imagery, and the Room-to-Room (R2R) dataset, the first benchmark for VLN in real, previously unseen building-scale 3D environments. The R2R dataset contains 21,567 open-vocabulary navigation instructions with an average length of 29 words, describing trajectories through multiple rooms. The simulator enables the evaluation of VLN methods by providing a realistic environment for agents to navigate based on natural language instructions. The paper also presents a sequence-to-sequence neural network model for VLN, along with several baselines, and evaluates their performance on the R2R dataset. Results show that while the model performs well in previously seen environments, generalization to unseen environments is challenging. The study highlights the importance of visual grounding in navigation tasks and the need for further research in this area. The simulator and dataset are available for use in future research.