21 Feb 2020 | Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, David Silver
MuZero is a model-based reinforcement learning algorithm that combines tree-based search with a learned model to achieve superhuman performance in a range of challenging domains, including Atari, Go, chess, and shogi. Unlike traditional model-based methods that require knowledge of the environment's dynamics, MuZero learns a model that predicts the reward, action-selection policy, and value function, enabling it to perform without prior knowledge of the game rules. The algorithm uses a learned model to simulate future states and plan, while also incorporating a search process similar to AlphaZero. MuZero achieves state-of-the-art results in visually complex domains like Atari, where previous model-based approaches struggled, and matches the performance of AlphaZero in games like Go, chess, and shogi. The algorithm's model is trained end-to-end to accurately predict the three key quantities needed for planning, and it uses a recurrent process to update the hidden state iteratively. MuZero also extends AlphaZero to a broader set of environments, including single-agent domains and non-zero rewards at intermediate time steps. The algorithm's main idea is to predict the aspects of the future that are directly relevant for planning, using a learned model to simulate future states and plan. MuZero's performance is evaluated across 57 Atari games and three board games, achieving new state-of-the-art results in both visually complex and logically complex domains. The algorithm's success demonstrates the potential of model-based reinforcement learning in real-world applications where the environment's dynamics are unknown.MuZero is a model-based reinforcement learning algorithm that combines tree-based search with a learned model to achieve superhuman performance in a range of challenging domains, including Atari, Go, chess, and shogi. Unlike traditional model-based methods that require knowledge of the environment's dynamics, MuZero learns a model that predicts the reward, action-selection policy, and value function, enabling it to perform without prior knowledge of the game rules. The algorithm uses a learned model to simulate future states and plan, while also incorporating a search process similar to AlphaZero. MuZero achieves state-of-the-art results in visually complex domains like Atari, where previous model-based approaches struggled, and matches the performance of AlphaZero in games like Go, chess, and shogi. The algorithm's model is trained end-to-end to accurately predict the three key quantities needed for planning, and it uses a recurrent process to update the hidden state iteratively. MuZero also extends AlphaZero to a broader set of environments, including single-agent domains and non-zero rewards at intermediate time steps. The algorithm's main idea is to predict the aspects of the future that are directly relevant for planning, using a learned model to simulate future states and plan. MuZero's performance is evaluated across 57 Atari games and three board games, achieving new state-of-the-art results in both visually complex and logically complex domains. The algorithm's success demonstrates the potential of model-based reinforcement learning in real-world applications where the environment's dynamics are unknown.