[slides] Bandit Based Monte-Carlo Planning

The paper introduces a new Monte-Carlo planning algorithm, UCT (Upper Confidence Bounds Tree), which applies bandit ideas to guide the planning process. UCT is designed to improve the efficiency of Monte-Carlo planning in large state-space Markovian Decision Problems (MDPs) and game-tree search. The algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results demonstrate that UCT outperforms its alternatives in several domains, including P-games and a sailing domain. The main contribution of the paper is the application of the UCB1 bandit algorithm to rollout-based Monte-Carlo planning, allowing for selective sampling of actions to balance exploration and exploitation. Theoretical analysis shows that UCT converges to the optimal action with high probability as the number of samples increases.The paper introduces a new Monte-Carlo planning algorithm, UCT (Upper Confidence Bounds Tree), which applies bandit ideas to guide the planning process. UCT is designed to improve the efficiency of Monte-Carlo planning in large state-space Markovian Decision Problems (MDPs) and game-tree search. The algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results demonstrate that UCT outperforms its alternatives in several domains, including P-games and a sailing domain. The main contribution of the paper is the application of the UCB1 bandit algorithm to rollout-based Monte-Carlo planning, allowing for selective sampling of actions to balance exploration and exploitation. Theoretical analysis shows that UCT converges to the optimal action with high probability as the number of samples increases.

Bandit Based Monte-Carlo Planning

2006 | Levente Kocsis and Csaba Szepesvári