Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
By: Maximilian Weichart
Potential Business Impact:
Makes smart game players learn faster and better.
Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: github.com/Max-We/inverse-rpo.
Similar Papers
Investigating Intra-Abstraction Policies For Non-exact Abstraction Algorithms
Artificial Intelligence
Teaches computers to make better choices faster.
Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
Artificial Intelligence
Teaches computers to learn better from choices.
Tail-Risk-Safe Monte Carlo Tree Search under PAC-Level Guarantees
Machine Learning (CS)
Makes smart programs avoid dangerous bad choices.