From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing language-guided humanoid robot control suffers from multi-stage error accumulation, high latency, and weak semantic–motion coupling. This paper introduces RoboGhost, the first framework enabling end-to-end language-to-action mapping without motion retargeting. It achieves strong semantic–motion coupling by learning a compact, physically grounded action latent space, and employs a hybrid causal Transformer-diffusion policy network to directly generate kinematically feasible, temporally consistent, and behaviorally diverse action sequences from multimodal inputs (text, images, and audio). The model is trained end-to-end in simulation and transfers zero-shot to real humanoid robots. Experiments demonstrate that RoboGhost reduces inference latency by over 50%, significantly improves task success rates and action tracking accuracy, and enables fluent, semantically aligned multimodal motion control.

Technology Category

Application Category

📝 Abstract

Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision-language-action humanoid systems.

Problem

Research questions and friction points this paper is trying to address.

Eliminates multi-stage motion decoding and retargeting for humanoid control

Directly generates executable actions from language via motion latents

Reduces latency while maintaining semantic alignment in locomotion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct humanoid control via language-grounded motion latents

Diffusion-based policy denoises actions from noise

Hybrid causal transformer-diffusion ensures long-horizon consistency

🔎 Similar Papers

No similar papers found.

Authors to Follow