From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language-guided humanoid robot control suffers from multi-stage error accumulation, high latency, and weak semantic–motion coupling. This paper introduces RoboGhost, the first framework enabling end-to-end language-to-action mapping without motion retargeting. It achieves strong semantic–motion coupling by learning a compact, physically grounded action latent space, and employs a hybrid causal Transformer-diffusion policy network to directly generate kinematically feasible, temporally consistent, and behaviorally diverse action sequences from multimodal inputs (text, images, and audio). The model is trained end-to-end in simulation and transfers zero-shot to real humanoid robots. Experiments demonstrate that RoboGhost reduces inference latency by over 50%, significantly improves task success rates and action tracking accuracy, and enables fluent, semantically aligned multimodal motion control.

Technology Category

Application Category

📝 Abstract
Natural language offers a natural interface for humanoid robots, but existing language-guided humanoid locomotion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer-diffusion motion generator further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision-language-action humanoid systems.
Problem

Research questions and friction points this paper is trying to address.

Eliminates multi-stage motion decoding and retargeting for humanoid control
Directly generates executable actions from language via motion latents
Reduces latency while maintaining semantic alignment in locomotion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct humanoid control via language-grounded motion latents
Diffusion-based policy denoises actions from noise
Hybrid causal transformer-diffusion ensures long-horizon consistency
🔎 Similar Papers
No similar papers found.
Z
Zhe Li
University of Sydney
Cheng Chi
Cheng Chi
Columbia University, Stanford University
robotics
Y
Yangyang Wei
Harbin Institute of Technology
B
Boan Zhu
Hong Kong University of Science and Technology
Yibo Peng
Yibo Peng
Carnegie Mellon University
Code GenerationMultimodal NLPAI Agents
T
Tao Huang
Shanghai Jiao Tong University
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Z
Zhongyuan Wang
BAAI
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
C
Chang Xu
University of Sydney