🤖 AI Summary
This work addresses the challenge of semantic mismatch between high-level task instructions and low-level whole-body control commands in real-world humanoid robot deployment, where existing approaches struggle to generate robust motions from natural language. The authors propose HANDOFF, a unified whole-body controller built upon a compact, explicit task-space interface that enables end-to-end mapping from natural language instructions to robust full-body actions without task-specific fine-tuning. HANDOFF introduces context-gated multi-teacher KL distillation to seamlessly integrate three expert policies—motion tracking, walking, and fall recovery—within a conditional gating mechanism and mixture-of-experts architecture, augmented with safety-aware filtering during training. Evaluated on the Unitree G1 platform, the method achieves state-of-the-art linear velocity tracking accuracy and an expansive robust operating range, successfully demonstrating diverse instruction-following tasks on real hardware.
📝 Abstract
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.