π€ AI Summary
Existing World Action Models (WAMs) struggle to enable real-time whole-body coordination for humanoid robots due to the computational burden of iterative denoising in high-dimensional video-action latent spaces, and conventional hierarchical policies often decouple upper- and lower-body action spaces, preventing legs from participating in task interaction. This work proposes MotionWAM, which abandons the upperβlower body separation and instead introduces a unified whole-body motion latent space that jointly models locomotion, torso adjustment, foot interaction, and hand manipulation. By conditioning the policy on intermediate denoising features from a video world model and employing a three-stage progressive learning framework, MotionWAM achieves end-to-end real-time control using only monocular first-person visual input. Evaluated on nine real-world Unitree G1 tasks, the system operates in real time, surpassing a fine-tuned VLA baseline by over 30% in success rate and demonstrating, for the first time, task-driven foot interaction.
π Abstract
World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.