Understanding Counterfactual Multi-Agent Policy Gradients

The paper introduces a new multi-agent reinforcement learning (RL) method called *Counterfactual Multi-Agent (COMA)* policy gradients. COMA addresses the challenges of decentralized policy learning and multi-agent credit assignment in cooperative multi-agent systems. It uses a centralized critic to estimate the *Q*-function and decentralized actors to optimize each agent's policy. A key innovation is the *counterfactual baseline*, which marginalizes out a single agent's action while keeping the other agents' actions fixed, allowing for efficient computation of the advantage function. COMA also employs a critic representation that enables the counterfactual baseline to be computed in a single forward pass. The method is evaluated in the *StarCraft unit micromanagement* benchmark, showing significant improvements over other multi-agent actor-critic methods and competitive performance with state-of-the-art centralized controllers. The paper discusses related work, provides background on multi-agent RL, and details the experimental setup and results.The paper introduces a new multi-agent reinforcement learning (RL) method called *Counterfactual Multi-Agent (COMA)* policy gradients. COMA addresses the challenges of decentralized policy learning and multi-agent credit assignment in cooperative multi-agent systems. It uses a centralized critic to estimate the *Q*-function and decentralized actors to optimize each agent's policy. A key innovation is the *counterfactual baseline*, which marginalizes out a single agent's action while keeping the other agents' actions fixed, allowing for efficient computation of the advantage function. COMA also employs a critic representation that enables the counterfactual baseline to be computed in a single forward pass. The method is evaluated in the *StarCraft unit micromanagement* benchmark, showing significant improvements over other multi-agent actor-critic methods and competitive performance with state-of-the-art centralized controllers. The paper discusses related work, provides background on multi-agent RL, and details the experimental setup and results.

Counterfactual Multi-Agent Policy Gradients

14 Dec 2017 | Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson