GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the scarcity of high-quality human-robot interaction demonstration data, a key bottleneck in scaling humanoid robot deployment, as traditional teleoperation and motion capture are difficult to scale. The authors propose a fully virtual generation framework that synthesizes large-scale 4D human-object interaction sequences without requiring physical reconstruction or manual teleoperation, leveraging 3D assets, simulation-ready scenes, and priors from video foundation models. By guiding video generation and reconstruction with known 3D configurations, the method effectively mitigates depth ambiguity and shape mismatch issues. It integrates video foundation models, object tracking, human motion estimation, and interaction-aware optimization to recover high-fidelity trajectories. Using over 20,000 generated interaction sequences, the approach achieves an 84% object-picking success rate and a 90% stair-climbing success rate on the Unitree G1 humanoid robot using synthetic data alone.

📝 Abstract

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

Problem

Research questions and friction points this paper is trying to address.

humanoid locomotion

manipulation

demonstration generation

human-object interaction

sim-to-real

Innovation

Methods, ideas, or system contributions that make the work stand out.

humanoid locomotion

4D human-object interaction

video foundation models