6 Jun 2018 | Tabish Rashid * 1 Mikayel Samvelyan * 2 Christian Schroeder de Witt 1 Gregory Farquhar 1 Jakob Foerster 1 Shimon Whiteson 1
QMIX is a novel value-based method for deep multi-agent reinforcement learning that enables the training of decentralized policies in a centralized end-to-end manner. The method uses a network that estimates joint action-values as a complex non-linear combination of per-agent values, which condition only on local observations. A key innovation is the structural enforcement of monotonicity in the relationship between the joint-action value and individual agent values, ensuring tractable maximization of the joint action-value in off-policy learning and consistency between centralized and decentralized policies. QMIX outperforms existing value-based multi-agent reinforcement learning methods on challenging StarCraft II micromanagement tasks, demonstrating significant improvements in performance and learning speed. The method is evaluated on a range of unit micromanagement tasks in StarCraft II, showing that QMIX can represent a richer class of action-value functions than previous methods like VDN. QMIX's ability to condition on extra state information and use non-linear mixing of agent Q-values allows it to achieve consistent performance across different tasks. The method is trained end-to-end to minimize a loss function that measures the difference between the predicted and target joint action-values. QMIX's architecture includes agent networks that represent individual value functions and a mixing network that combines them into a joint action-value function in a complex non-linear way. The mixing network is constrained to have positive weights to enforce monotonicity. The method is evaluated on a variety of tasks, including a two-step cooperative matrix game and StarCraft II micromanagement scenarios, where it outperforms other methods in terms of performance and learning efficiency. The results show that QMIX can effectively learn decentralized policies in a centralized setting, making efficient use of extra state information and achieving superior performance in complex multi-agent environments.QMIX is a novel value-based method for deep multi-agent reinforcement learning that enables the training of decentralized policies in a centralized end-to-end manner. The method uses a network that estimates joint action-values as a complex non-linear combination of per-agent values, which condition only on local observations. A key innovation is the structural enforcement of monotonicity in the relationship between the joint-action value and individual agent values, ensuring tractable maximization of the joint action-value in off-policy learning and consistency between centralized and decentralized policies. QMIX outperforms existing value-based multi-agent reinforcement learning methods on challenging StarCraft II micromanagement tasks, demonstrating significant improvements in performance and learning speed. The method is evaluated on a range of unit micromanagement tasks in StarCraft II, showing that QMIX can represent a richer class of action-value functions than previous methods like VDN. QMIX's ability to condition on extra state information and use non-linear mixing of agent Q-values allows it to achieve consistent performance across different tasks. The method is trained end-to-end to minimize a loss function that measures the difference between the predicted and target joint action-values. QMIX's architecture includes agent networks that represent individual value functions and a mixing network that combines them into a joint action-value function in a complex non-linear way. The mixing network is constrained to have positive weights to enforce monotonicity. The method is evaluated on a variety of tasks, including a two-step cooperative matrix game and StarCraft II micromanagement scenarios, where it outperforms other methods in terms of performance and learning efficiency. The results show that QMIX can effectively learn decentralized policies in a centralized setting, making efficient use of extra state information and achieving superior performance in complex multi-agent environments.