Counterfactual Multi-Agent Policy Gradients

Counterfactual Multi-Agent Policy Gradients

14 Dec 2017 | Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson
This paper introduces COMA (Counterfactual Multi-Agent) policy gradients, a new multi-agent reinforcement learning method that addresses the challenges of credit assignment in decentralized policies. COMA uses a centralized critic to estimate the Q-function and decentralized actors to optimize individual policies. It also employs a counterfactual baseline that marginalizes out a single agent's action while keeping others fixed, allowing efficient computation of the baseline in a single forward pass. The method is evaluated in the StarCraft micromanagement benchmark, where it significantly improves performance over other multi-agent actor-critic methods and is competitive with state-of-the-art centralized controllers. COMA addresses the problem of multi-agent credit assignment by using a counterfactual baseline that allows each agent to learn from the difference between the global reward and the reward obtained by replacing its action with a default action. This approach avoids the need for extra simulations or assumptions about default actions, and instead uses the centralized critic to compute the baseline directly from the agents' experiences. The method also uses a critic representation that enables efficient computation of the baseline, allowing for the evaluation of counterfactual scenarios in a single forward pass. The paper evaluates COMA in a decentralized StarCraft micromanagement setting with restricted field of view and no access to macro-actions. The results show that COMA significantly improves performance over other methods and is competitive with centralized controllers. The method is shown to be effective in both training speed and final performance, and is the best performing and most consistent method among the evaluated approaches. The results demonstrate that COMA is a promising approach for multi-agent reinforcement learning in decentralized settings.This paper introduces COMA (Counterfactual Multi-Agent) policy gradients, a new multi-agent reinforcement learning method that addresses the challenges of credit assignment in decentralized policies. COMA uses a centralized critic to estimate the Q-function and decentralized actors to optimize individual policies. It also employs a counterfactual baseline that marginalizes out a single agent's action while keeping others fixed, allowing efficient computation of the baseline in a single forward pass. The method is evaluated in the StarCraft micromanagement benchmark, where it significantly improves performance over other multi-agent actor-critic methods and is competitive with state-of-the-art centralized controllers. COMA addresses the problem of multi-agent credit assignment by using a counterfactual baseline that allows each agent to learn from the difference between the global reward and the reward obtained by replacing its action with a default action. This approach avoids the need for extra simulations or assumptions about default actions, and instead uses the centralized critic to compute the baseline directly from the agents' experiences. The method also uses a critic representation that enables efficient computation of the baseline, allowing for the evaluation of counterfactual scenarios in a single forward pass. The paper evaluates COMA in a decentralized StarCraft micromanagement setting with restricted field of view and no access to macro-actions. The results show that COMA significantly improves performance over other methods and is competitive with centralized controllers. The method is shown to be effective in both training speed and final performance, and is the best performing and most consistent method among the evaluated approaches. The results demonstrate that COMA is a promising approach for multi-agent reinforcement learning in decentralized settings.
Reach us at info@study.space