The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

4 Nov 2022 | Chao Yu12*, Akash Velu22*, Eugene Vinitsky29, Jiaxuan Gao1, Yu Wang15, Alexandre Bayen2, Yi Wu139
This paper investigates the effectiveness of Proximal Policy Optimization (PPO) in cooperative multi-agent settings. PPO is a popular on-policy reinforcement learning algorithm, but it has been less utilized in multi-agent systems due to the belief that it is less sample-efficient than off-policy methods. The authors demonstrate that PPO-based multi-agent algorithms achieve strong performance in four popular cooperative multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge. These results are achieved with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, PPO often achieves competitive or superior results in both final returns and sample efficiency compared to competitive off-policy methods. Through ablation studies, the authors analyze implementation and hyperparameter factors that are critical to PPO's empirical performance and provide concrete practical suggestions regarding these factors. Their results show that simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. The source code is released at https://github.com/marlbenchmark/on-policy.This paper investigates the effectiveness of Proximal Policy Optimization (PPO) in cooperative multi-agent settings. PPO is a popular on-policy reinforcement learning algorithm, but it has been less utilized in multi-agent systems due to the belief that it is less sample-efficient than off-policy methods. The authors demonstrate that PPO-based multi-agent algorithms achieve strong performance in four popular cooperative multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge. These results are achieved with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, PPO often achieves competitive or superior results in both final returns and sample efficiency compared to competitive off-policy methods. Through ablation studies, the authors analyze implementation and hyperparameter factors that are critical to PPO's empirical performance and provide concrete practical suggestions regarding these factors. Their results show that simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. The source code is released at https://github.com/marlbenchmark/on-policy.
Reach us at info@study.space