Understanding Vision-and-Language Navigation%3A Interpreting Visually-Grounded Navigation Instructions in Real Environments

The paper introduces the Matterport3D Simulator, a large-scale reinforcement learning environment based on real imagery, and presents the Room-to-Room (R2R) dataset, the first benchmark for visually-grounded natural language navigation in real buildings. The R2R dataset contains 21,567 open-vocabulary, crowd-sourced navigation instructions with an average length of 29 words, designed to simplify the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions. The simulator, which is constructed from the Matterport3D dataset, offers a rich and diverse visual environment, preserving the richness of real-world images and enabling trained agents to be transferred to real-world applications. The paper also discusses the evaluation protocol, including the definition of navigation error and the success criteria, and presents several baselines and a sequence-to-sequence neural network model for the R2R task. The results show that while existing vision and language methods can be successfully applied, generalizing to unseen environments remains a significant challenge. The paper concludes by highlighting the importance of VLN for practical robotics and the potential of the simulator and dataset for further research in related areas.The paper introduces the Matterport3D Simulator, a large-scale reinforcement learning environment based on real imagery, and presents the Room-to-Room (R2R) dataset, the first benchmark for visually-grounded natural language navigation in real buildings. The R2R dataset contains 21,567 open-vocabulary, crowd-sourced navigation instructions with an average length of 29 words, designed to simplify the application of vision and language methods to the problem of interpreting visually-grounded navigation instructions. The simulator, which is constructed from the Matterport3D dataset, offers a rich and diverse visual environment, preserving the richness of real-world images and enabling trained agents to be transferred to real-world applications. The paper also discusses the evaluation protocol, including the definition of navigation error and the success criteria, and presents several baselines and a sequence-to-sequence neural network model for the R2R task. The results show that while existing vision and language methods can be successfully applied, generalizing to unseen environments remains a significant challenge. The paper concludes by highlighting the importance of VLN for practical robotics and the potential of the simulator and dataset for further research in related areas.

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

5 Apr 2018 | Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, Anton van den Hengel