π€ AI Summary
Existing approaches to humanoid motion struggle to achieve zero-shot generalization to unseen actions and tasks due to data scarcity and the trade-off between agility and generalization. This work proposes a GPT-style Transformer architecture grounded in causal attention mechanisms, introducing large-scale pretraining to humanoid control for the first time. By integrating multi-source motion capture and internally recorded data, the authors construct a unified retargeted action corpus comprising two billion frames for training. This approach overcomes critical scaling bottlenecks in both data and model architecture, significantly outperforming current methods on highly dynamic and complex motion tracking tasks while demonstrating strong zero-shot transfer capabilities.
π Abstract
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.