Understanding Vision-and-Language Navigation via Causal Learning

This paper introduces GOAT, a novel approach for vision-and-language navigation (VLN) that addresses dataset bias through causal learning. The method employs back-door and front-door adjustment causal learning (BACL and FACL) to mitigate both observable and unobservable confounders in vision, language, and history. A cross-modal feature pooling (CFP) module, supervised by contrastive learning, is used to aggregate long-sequential features and improve cross-modal representations. The proposed approach is evaluated on multiple VLN datasets (R2R, REVERIE, RxR, and SOON), demonstrating superior performance compared to existing state-of-the-art methods. The results show that GOAT achieves significant improvements in navigation accuracy, instruction-following, and object grounding across both seen and unseen environments. The causal learning pipeline is designed to enhance the generalization capabilities of VLN agents by enabling unbiased feature learning and decision-making. The method's effectiveness is validated through extensive experiments, highlighting its robustness and generalization capabilities in diverse scenarios.This paper introduces GOAT, a novel approach for vision-and-language navigation (VLN) that addresses dataset bias through causal learning. The method employs back-door and front-door adjustment causal learning (BACL and FACL) to mitigate both observable and unobservable confounders in vision, language, and history. A cross-modal feature pooling (CFP) module, supervised by contrastive learning, is used to aggregate long-sequential features and improve cross-modal representations. The proposed approach is evaluated on multiple VLN datasets (R2R, REVERIE, RxR, and SOON), demonstrating superior performance compared to existing state-of-the-art methods. The results show that GOAT achieves significant improvements in navigation accuracy, instruction-following, and object grounding across both seen and unseen environments. The causal learning pipeline is designed to enhance the generalization capabilities of VLN agents by enabling unbiased feature learning and decision-making. The method's effectiveness is validated through extensive experiments, highlighting its robustness and generalization capabilities in diverse scenarios.

Vision-and-Language Navigation via Causal Learning

16 Apr 2024 | LiuYi Wang, Zongtao He, Ronghao Dang, Mengjiao Shen, Chengju Liu, Qijun Chen