GOAT-Bench is a benchmark for multi-modal lifelong navigation, designed to evaluate agents navigating to a sequence of goals specified through category names, language descriptions, or images. The benchmark includes 181 HM3DSem scenes, 312 object categories, and 680k episodes. It features open-vocabulary, multi-modal goals and lifelong learning, where each episode consists of 5-10 goals specified through different modalities. The benchmark compares two types of methods: modular learning methods and SenseAct-NN policies trained with and without memory. Modular methods, which use semantic maps, perform better in terms of efficiency (SPL) and robustness to noise, while SenseAct-NN methods achieve higher success rates but are less efficient. The results highlight the importance of effective memory representations for improving navigation efficiency. The benchmark also evaluates the performance of methods across different modalities and their robustness to noise in goal specifications. Overall, the benchmark provides a comprehensive analysis of multi-modal lifelong navigation methods and their effectiveness in handling various goal types.GOAT-Bench is a benchmark for multi-modal lifelong navigation, designed to evaluate agents navigating to a sequence of goals specified through category names, language descriptions, or images. The benchmark includes 181 HM3DSem scenes, 312 object categories, and 680k episodes. It features open-vocabulary, multi-modal goals and lifelong learning, where each episode consists of 5-10 goals specified through different modalities. The benchmark compares two types of methods: modular learning methods and SenseAct-NN policies trained with and without memory. Modular methods, which use semantic maps, perform better in terms of efficiency (SPL) and robustness to noise, while SenseAct-NN methods achieve higher success rates but are less efficient. The results highlight the importance of effective memory representations for improving navigation efficiency. The benchmark also evaluates the performance of methods across different modalities and their robustness to noise in goal specifications. Overall, the benchmark provides a comprehensive analysis of multi-modal lifelong navigation methods and their effectiveness in handling various goal types.