Human Motion Video Generation: A Survey

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surveys predominantly focus on isolated methodologies, lacking a systematic examination of the end-to-end pipeline for human motion video generation. To address this gap, we propose the first unified framework encompassing five core stages: input specification, motion planning, generation, optimization, and output rendering—supporting over ten subtasks driven by visual, textual, and audio modalities. Our work systematically reviews 200+ papers and constructs the field’s first comprehensive technical taxonomy. We innovatively investigate the potential of large language models (LLMs) for motion semantic modeling and cross-modal alignment. Furthermore, we integrate state-of-the-art techniques—including diffusion models, generative adversarial networks (GANs), and multimodal fusion—to identify key breakthroughs and release an open-source model library. This study fills a critical void in holistic, cross-cutting research on human motion video generation, providing both theoretical foundations and practical guidelines for applications such as digital avatars.

Technology Category

Application Category

📝 Abstract
Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive overview in human motion video generation
Addressing gaps across ten sub-tasks and five key phases
Exploring large language models' potential for motion generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive survey covering ten sub-tasks generation process
First survey discussing large language models potential applications
Reviews latest developments across vision text audio modalities
🔎 Similar Papers
No similar papers found.
Haiwei Xue
Haiwei Xue
Tsinghua University
X
Xiangyang Luo
Tsinghua University
Zhanghao Hu
Zhanghao Hu
School of Informatics, King's College London
Natural Language ProcessingArtificial IntelligenceMulti-modal Processing
X
Xin Zhang
School of Mathematics and Statistics, Xi’an Jiaotong University
Xunzhi Xiang
Xunzhi Xiang
Nanjing University
Yuqin Dai
Yuqin Dai
Tsinghua University
LLMAI4ScienceAvatarGenerative Model
Jianzhuang Liu
Jianzhuang Liu
Shenzhen Institutes of Advanced Technology, University of Chinese Academy of Sciences
Computer VisionImage ProcessingAIGCMachine Learning
Z
Zhensong Zhang
Huawei Noah’s Ark Lab
M
Minglei Li
01.AI
J
Jian Yang
PCALab, Nanjing University of Science and Technology
F
Fei Ma
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Z
Zhiyong Wu
Tsinghua University
C
Changpeng Yang
01.AI
Z
Zonghong Dai
Artificial Intelligence Innovation and Incubation (AI) Institute of Fudan University
F
Fei Richard Yu
Shenzhen University and Carleton University