PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators

28 Jun 2024 | Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs
**PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators** **Abstract:** PoliFormer (Policy Transformer) is an RGB-only indoor navigation agent trained end-to-end with reinforcement learning (RL) at scale, achieving state-of-the-art (SoTA) results in both simulation and real-world environments. The model uses a foundational vision transformer encoder and a causal transformer decoder to enable long-term memory and reasoning. It is trained on hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training. PoliFormer outperforms previous work, achieving an 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. It can also be extended to various downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no fine-tuning. **Introduction:** PoliFormer addresses the challenges of training deep RL agents for complex navigation tasks, particularly Object Goal Navigation (OGN). Unlike imitation learning (IL), which often relies on expert demonstrations, PoliFormer uses on-policy RL, allowing for deep exploration of the state space. The model's architecture, training methodology, and environment interactions are designed to scale effectively, enabling it to achieve SoTA results across multiple benchmarks. **Method:** PoliFormer's architecture includes a vision transformer encoder, a transformer state encoder, and a causal transformer decoder. The model is trained using on-policy RL in the AI2-THOR simulator, leveraging parallel rollouts and large batch sizes to achieve high training throughput. The training environment is optimized to support high-speed training, and the model is evaluated on both simulated and real-world benchmarks. **Results:** PoliFormer achieves SoTA performance on four simulation benchmarks and two real-world benchmarks across two different embodiments (LoCoBot and Stretch RE-1). It outperforms previous work, demonstrating its ability to generalize from simulation to the real world. Ablation studies and real-world experiments further validate the effectiveness of PoliFormer. **Discussion:** PoliFormer's performance suggests that further scaling of model parameters and training time could lead to even better results. Limitations include the lack of a depth sensor and the need for a more realistic discretized action space. Future work will explore these directions and the potential for cross-embodiment training.**PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators** **Abstract:** PoliFormer (Policy Transformer) is an RGB-only indoor navigation agent trained end-to-end with reinforcement learning (RL) at scale, achieving state-of-the-art (SoTA) results in both simulation and real-world environments. The model uses a foundational vision transformer encoder and a causal transformer decoder to enable long-term memory and reasoning. It is trained on hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training. PoliFormer outperforms previous work, achieving an 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. It can also be extended to various downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no fine-tuning. **Introduction:** PoliFormer addresses the challenges of training deep RL agents for complex navigation tasks, particularly Object Goal Navigation (OGN). Unlike imitation learning (IL), which often relies on expert demonstrations, PoliFormer uses on-policy RL, allowing for deep exploration of the state space. The model's architecture, training methodology, and environment interactions are designed to scale effectively, enabling it to achieve SoTA results across multiple benchmarks. **Method:** PoliFormer's architecture includes a vision transformer encoder, a transformer state encoder, and a causal transformer decoder. The model is trained using on-policy RL in the AI2-THOR simulator, leveraging parallel rollouts and large batch sizes to achieve high training throughput. The training environment is optimized to support high-speed training, and the model is evaluated on both simulated and real-world benchmarks. **Results:** PoliFormer achieves SoTA performance on four simulation benchmarks and two real-world benchmarks across two different embodiments (LoCoBot and Stretch RE-1). It outperforms previous work, demonstrating its ability to generalize from simulation to the real world. Ablation studies and real-world experiments further validate the effectiveness of PoliFormer. **Discussion:** PoliFormer's performance suggests that further scaling of model parameters and training time could lead to even better results. Limitations include the lack of a depth sensor and the need for a more realistic discretized action space. Future work will explore these directions and the potential for cross-embodiment training.
Reach us at info@study.space