This paper introduces GOAT, a novel approach for vision-and-language navigation (VLN) that addresses dataset bias through causal learning. The method employs back-door and front-door adjustment causal learning (BACL and FACL) modules to mitigate observable and unobservable confounders, respectively. Additionally, a cross-modal feature pooling (CFP) module is introduced to enhance cross-modal representations and improve generalization. The proposed framework is validated across multiple VLN datasets (R2R, REVERIE, RxR, and SOON), demonstrating superior performance compared to existing state-of-the-art methods. The causal learning pipeline enables unbiased feature learning and decision-making by integrating causal inference principles into the model. The experiments show that GOAT significantly improves navigation accuracy and generalization capabilities, highlighting the effectiveness of causal learning in enhancing VLN systems. The method also provides a systematic approach for extracting coherent representations from sequence inputs, offering valuable insights for future research in similar tasks.This paper introduces GOAT, a novel approach for vision-and-language navigation (VLN) that addresses dataset bias through causal learning. The method employs back-door and front-door adjustment causal learning (BACL and FACL) modules to mitigate observable and unobservable confounders, respectively. Additionally, a cross-modal feature pooling (CFP) module is introduced to enhance cross-modal representations and improve generalization. The proposed framework is validated across multiple VLN datasets (R2R, REVERIE, RxR, and SOON), demonstrating superior performance compared to existing state-of-the-art methods. The causal learning pipeline enables unbiased feature learning and decision-making by integrating causal inference principles into the model. The experiments show that GOAT significantly improves navigation accuracy and generalization capabilities, highlighting the effectiveness of causal learning in enhancing VLN systems. The method also provides a systematic approach for extracting coherent representations from sequence inputs, offering valuable insights for future research in similar tasks.