🤖 AI Summary
Policy-guided tree search suffers from low sample efficiency, particularly on challenging instances where repeated failed search attempts waste substantial computational resources.
Method: This paper proposes a learning-based subgoal-guided policy framework. Its core innovation is the first extraction of effective subgoal signals directly from *failed* search trees, enabling joint online learning of subgoal representations and subgoal-conditioned policies—breaking from conventional paradigms that rely exclusively on complete successful trajectories. The method comprises a subgoal discovery network, online tree trajectory backtracking for learning, and joint optimization of policy and heuristic functions.
Results: Experiments demonstrate that on hard instances, policy convergence accelerates by 2.3×, heuristic estimation error decreases by 37%, and the number of search expansions reduces by 41%.
📝 Abstract
Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.