| David Silver*, Julian Schrittwieser*, Karen Simonyan*, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis.
AlphaGo Zero, developed by DeepMind, is a reinforcement learning algorithm that achieves superhuman performance in the game of Go without any human data, guidance, or domain knowledge beyond the game rules. It learns entirely through self-play, starting from random moves, and improves its strategy by using a single neural network to predict move probabilities and evaluate positions. This neural network is trained using a novel reinforcement learning algorithm that incorporates Monte-Carlo tree search (MCTS) within the training loop, enabling rapid improvement and stable learning. AlphaGo Zero outperforms previous versions of AlphaGo, which were trained using human data, by defeating them in a 100-game match with a score of 89-11.
The algorithm uses a deep neural network that combines the roles of both policy and value networks into a single architecture. The neural network is trained using self-play reinforcement learning, where each move is selected based on the probabilities generated by the neural network, and the game outcome is used to update the network parameters. This process is repeated, with the neural network gradually improving its ability to predict move probabilities and evaluate positions. The MCTS is used to guide the search and improve the policy, allowing the algorithm to explore the vast search space of Go efficiently.
AlphaGo Zero's training process involves generating millions of self-play games, with each game contributing to the training of the neural network. The algorithm is evaluated against previous versions of AlphaGo and other Go programs, demonstrating its superior performance. The results show that a pure reinforcement learning approach can achieve superhuman performance in complex domains without human data or guidance. The algorithm's success highlights the potential of reinforcement learning in artificial intelligence, particularly in domains where human expertise is limited. AlphaGo Zero's ability to rediscover much of the Go knowledge accumulated by humans over centuries, as well as novel strategies, underscores the power of self-play learning in achieving high-level performance in complex tasks.AlphaGo Zero, developed by DeepMind, is a reinforcement learning algorithm that achieves superhuman performance in the game of Go without any human data, guidance, or domain knowledge beyond the game rules. It learns entirely through self-play, starting from random moves, and improves its strategy by using a single neural network to predict move probabilities and evaluate positions. This neural network is trained using a novel reinforcement learning algorithm that incorporates Monte-Carlo tree search (MCTS) within the training loop, enabling rapid improvement and stable learning. AlphaGo Zero outperforms previous versions of AlphaGo, which were trained using human data, by defeating them in a 100-game match with a score of 89-11.
The algorithm uses a deep neural network that combines the roles of both policy and value networks into a single architecture. The neural network is trained using self-play reinforcement learning, where each move is selected based on the probabilities generated by the neural network, and the game outcome is used to update the network parameters. This process is repeated, with the neural network gradually improving its ability to predict move probabilities and evaluate positions. The MCTS is used to guide the search and improve the policy, allowing the algorithm to explore the vast search space of Go efficiently.
AlphaGo Zero's training process involves generating millions of self-play games, with each game contributing to the training of the neural network. The algorithm is evaluated against previous versions of AlphaGo and other Go programs, demonstrating its superior performance. The results show that a pure reinforcement learning approach can achieve superhuman performance in complex domains without human data or guidance. The algorithm's success highlights the potential of reinforcement learning in artificial intelligence, particularly in domains where human expertise is limited. AlphaGo Zero's ability to rediscover much of the Go knowledge accumulated by humans over centuries, as well as novel strategies, underscores the power of self-play learning in achieving high-level performance in complex tasks.