🤖 AI Summary
This paper addresses key challenges in human motion modeling—namely, difficulty in capturing semantic structure, poor generalization across domains, and limited adaptability to downstream tasks—and introduces MoFM, the first general-purpose foundation model for human motion. Methodologically, it proposes two novel components: (1) MotionBook, a learnable motion vocabulary that discretizes continuous motion into scalable semantic units; and (2) Thermal Cubes, a spatiotemporal thermal map encoding scheme. These are integrated with a discrete variational autoencoder and a joint spatiotemporal modeling architecture to enable efficient large-scale pretraining. Contributions include: (1) the first motion foundation model supporting diverse downstream paradigms—including zero-shot, few-shot, and fully supervised learning; and (2) substantial improvements in cross-domain generalization and few-shot adaptation for action recognition, generation, and understanding, achieving state-of-the-art performance on multiple benchmarks.
📝 Abstract
AFoundation Models (FM) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. Inspired by the success of FMs and the principles that have driven advancements in Large Language Models (LLMs), we introduce MoFM as a novel Motion Foundation Model. MoFM is designed for the semantic understanding of complex human motions in both time and space. To facilitate large-scale training, MotionBook, a comprehensive human motion dictionary of discretized motions is designed and employed. MotionBook utilizes Thermal Cubes to capture spatio-temporal motion heatmaps, applying principles from discrete variational models to encode human movements into discrete units for a more efficient and scalable representation. MoFM, trained on a large corpus of motion data, provides a foundational backbone adaptable to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks. This versatility makes MoFM well-suited for a wide range of motion-based applications.