Learning Human Skill Generators at Key-Step Levels

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models struggle to model complex human skills due to their multi-step, long-horizon, and dynamically shifting scene characteristics, which undermine autoregressive modeling. To address this, we propose Key-Step Skill Generation (KS-Gen): given an initial state and a skill description, generate only short video clips capturing *core procedural steps*, rather than full-length videos. Our contributions are threefold: (1) We formally define the KS-Gen task paradigm; (2) We construct the first high-quality, human-annotated dataset specifically targeting skill-critical steps; (3) We introduce a retrieval-augmented, three-stage collaborative framework integrating a multimodal large language model (MLLM), a key-step image generator (KIG), and a temporally consistent video generator. Evaluated on our curated benchmark, our method achieves state-of-the-art performance in both semantic fidelity and temporal coherence. All models and data are publicly released.

Technology Category

Application Category

📝 Abstract
We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.
Problem

Research questions and friction points this paper is trying to address.

Generate human skill videos
Reduce generation complexity
Enhance temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal large language model
Key-step Image Generator
High temporal consistency
🔎 Similar Papers
No similar papers found.
Yilu Wu
Yilu Wu
Nanjing University
Computer Vision
Chenhui Zhu
Chenhui Zhu
Lawrence Berkeley National Lab
Soft and Functional MaterialsSynchrotron X-Ray Science
S
Shuai Wang
State Key Laboratory for Novel Software Technology, Nanjing University
Hanlin Wang
Hanlin Wang
HKUST
Computer visionVideo understanding
J
Jing Wang
State Key Laboratory for Novel Software Technology, Nanjing University
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
L
Limin Wang
State Key Laboratory for Novel Software Technology, Nanjing University, Shanghai AI Lab