🤖 AI Summary
This work addresses the challenge in human motion generation where increasing the number of discrete tokens improves reconstruction quality but substantially raises the learning difficulty for generative models. To this end, the authors propose Language-Guided Tokenization (LG-Tok), a novel approach that introduces natural language into the motion tokenization stage for the first time. Leveraging a Transformer architecture, LG-Tok achieves global semantic alignment to produce compact yet semantically rich discrete representations. A language dropout training strategy is further designed to support both conditional and unconditional generation. Experiments demonstrate that LG-Tok achieves state-of-the-art performance on HumanML3D and Motion-X, with Top-1 accuracy scores of 0.542 and 0.582 and FID scores of 0.057 and 0.088, respectively. Notably, its lightweight variant, LG-Tok-mini, maintains competitive performance using only half the number of tokens.
📝 Abstract
In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens--a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.