Subgoal-Guided Policy Heuristic Search with Learned Subgoals

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Policy-guided tree search suffers from low sample efficiency, particularly on challenging instances where repeated failed search attempts waste substantial computational resources. Method: This paper proposes a learning-based subgoal-guided policy framework. Its core innovation is the first extraction of effective subgoal signals directly from *failed* search trees, enabling joint online learning of subgoal representations and subgoal-conditioned policies—breaking from conventional paradigms that rely exclusively on complete successful trajectories. The method comprises a subgoal discovery network, online tree trajectory backtracking for learning, and joint optimization of policy and heuristic functions. Results: Experiments demonstrate that on hard instances, policy convergence accelerates by 2.3×, heuristic estimation error decreases by 37%, and the number of search expansions reduces by 41%.

Technology Category

Application Category

📝 Abstract
Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.
Problem

Research questions and friction points this paper is trying to address.

Improves policy tree search sample efficiency
Learns subgoals from failed search attempts
Reduces training cost for hard instances
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subgoal-guided policy heuristic search
Learned subgoals from expanded trees
Improved sample efficiency online
🔎 Similar Papers
No similar papers found.
J
Jake E. Tuero
1Department of Computing Science, University of Alberta, Edmonton, Canada 2Alberta Machine Intelligence Institute (Amii), Edmonton, Canada
Michael Buro
Michael Buro
Professor of Computing Science, University of Alberta
Heuristic SearchMachine LearningPlanning
L
Levi H. S. Lelis
1Department of Computing Science, University of Alberta, Edmonton, Canada 2Alberta Machine Intelligence Institute (Amii), Edmonton, Canada