Regret-Guided Search Control for Efficient Learning in AlphaZero

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low sample efficiency and heavy reliance on extensive self-play data in AlphaZero-like algorithms by introducing a regret-guided search control mechanism. The proposed approach incorporates a regret estimation network to quantify the learning potential of game states and employs a priority regret buffer to preferentially replay high-regret states—those exhibiting large discrepancies between policy predictions and actual outcomes—as new search roots. This method represents the first integration of regret-driven prioritized experience replay into the Monte Carlo Tree Search (MCTS) and self-play reinforcement learning framework. Evaluated on 9×9 Go, 10×10 Othello, and 11×11 Hex, the approach achieves average Elo gains of 77–89 points over AlphaZero and Go-Exploit, and improves the win rate against KataGo on 9×9 Go from 69.3% to 78.2%.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) agents achieve remarkable performance but remain far less learning-efficient than humans. While RL agents require extensive self-play games to extract useful signals, humans often need only a few games, improving rapidly by repeatedly revisiting states where mistakes occurred. This idea, known as search control, aims to restart from valuable states rather than always from the initial state. In AlphaZero, prior work Go-Exploit applies this idea by sampling past states from self-play or search trees, but it treats all states equally, regardless of their learning potential. We propose Regret-Guided Search Control (RGSC), which extends AlphaZero with a regret network that learns to identify high-regret states, where the agent's evaluation diverges most from the actual outcome. These states are collected from both self-play trajectories and MCTS nodes, stored in a prioritized regret buffer, and reused as new starting positions. Across 9x9 Go, 10x10 Othello, and 11x11 Hex, RGSC outperforms AlphaZero and Go-Exploit by an average of 77 and 89 Elo, respectively. When training on a well-trained 9x9 Go model, RGSC further improves the win rate against KataGo from 69.3% to 78.2%, while both baselines show no improvement. These results demonstrate that RGSC provides an effective mechanism for search control, improving both efficiency and robustness of AlphaZero training. Our code is available at https://rlg.iis.sinica.edu.tw/papers/rgsc.
Problem

Research questions and friction points this paper is trying to address.

search control
learning efficiency
reinforcement learning
AlphaZero
regret
Innovation

Methods, ideas, or system contributions that make the work stand out.

Regret-Guided Search Control
AlphaZero
search control
regret network
prioritized replay
🔎 Similar Papers
No similar papers found.
Y
Yun-Jui Tsai
Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan; Institute of Information Science, Academia Sinica, Taiwan
Wei-Yu Chen
Wei-Yu Chen
Carnegie Mellon University, Apple
computational photographyAR/VRopticscomputer visiondeep learning
Y
Yan-Ru Ju
Institute of Information Science, Academia Sinica, Taiwan
Y
Yu-Hung Chang
Institute of Information Science, Academia Sinica, Taiwan
Ti-Rong Wu
Ti-Rong Wu
Institute of Information Science, Academia Sinica
Reinforcement learningPlanningComputer gamesDeep learningArtificial intelligence