GHOST: Hierarchical Sub-Goal Policies for Generalizing Robot Manipulation

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited out-of-distribution generalization of robotic manipulation policies by proposing a hierarchical policy architecture. The high-level policy predicts a distribution over 3D subgoal poses from multi-view RGB-D observations and projects it into image-plane heatmaps to guide the low-level policy’s actions. This design introduces a hierarchical interface that eliminates the need for action relabeling, enabling the high-level policy to directly learn task structure from human videos while the low-level policy is trained exclusively on robot-collected data. Integrating hierarchical reinforcement learning, end-to-end diffusion policies, and multi-view perception, the method significantly outperforms flat Diffusion Policy baselines across diverse manipulation tasks, demonstrating superior robustness and generalization to novel objects and task variations, with effective adaptation to new scenarios achievable using only a few human demonstrations.
📝 Abstract
We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
Problem

Research questions and friction points this paper is trying to address.

generalization
robot manipulation
sub-goal policies
human demonstrations
task variation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical policy
sub-goal
visuomotor manipulation
3D goal conditioning
human demonstration transfer