🤖 AI Summary
Existing text-to-action generation methods rely on end-to-end mapping, resulting in shallow semantic understanding, weak logical reasoning, poor action controllability, insufficient long-horizon consistency, and limited motion diversity. To address these limitations, we propose the first unified motion-language modeling framework that integrates Chain-of-Thought (CoT) reasoning with reinforcement learning. Our approach explicitly decomposes natural language instructions into structured action paths and introduces Group Relative Policy Optimization (GRPO), a novel RL algorithm that jointly optimizes CoT-based reasoning chain generation and motion synthesis. Leveraging large language models, the method performs multi-step semantic disentanglement and action path planning. Evaluated on multiple benchmarks, it achieves state-of-the-art performance, significantly improving long-horizon coherence, instruction fidelity, and motion diversity. All code, models, and data are publicly released.
📝 Abstract
Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.