🤖 AI Summary
This work addresses the high computational overhead and scalability limitations of subgoal-based policy tree search in complex deterministic single-agent tasks, which stem from explicit subgoal generation. To overcome these challenges, the authors propose a learned “rerooter” mechanism integrated with the √LTS algorithm to enable implicit soft subtask decomposition, thereby eliminating the need for explicit subgoal construction and inference and allowing more efficient allocation of search resources. Three rerooter variants are introduced: one leveraging global state structure via clustering, another fusing learned heuristics with cost-to-go estimates, and a hybrid combining both strategies—collectively enabling scalable tree search without handcrafted rerooters for the first time. Experiments demonstrate that the approach significantly outperforms conventional subgoal-based tree search across multiple complex environments, achieving state-of-the-art online training efficiency and successfully scaling to problem sizes previously intractable for existing methods.
📝 Abstract
Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced $\sqrt{\text{LTS}}$ algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.