SkyReels-V2: Infinite-length Film Generative Model

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video generation models face fundamental trade-offs among prompt adherence, visual fidelity, motion dynamics, and maximum generation duration: improving image quality often compromises camera motion naturalness, while temporal extent remains capped at 5–10 seconds; moreover, general-purpose multimodal large language models (MLLMs) lack the capacity to interpret cinematic grammar—including shot composition, actor micro-expressions, and camera motion. To address these limitations, we propose a structured video representation framework that synergistically integrates MLLMs with specialized storyboard models. We introduce diffusion-forced training and a non-increasing noise scheduling strategy to enable efficient, arbitrarily long video synthesis. Through a four-stage post-training pipeline—including phased pretraining and human feedback-based reinforcement learning—we significantly enhance long-horizon temporal consistency and cinematographic realism. Our approach supports high-fidelity, film-grade video generation exceeding 100 seconds. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.
Problem

Research questions and friction points this paper is trying to address.

Improving motion dynamics and visual quality in video generation
Extending video duration beyond current 5-10 second limits
Enhancing shot-aware generation with cinematic grammar understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal LLM and sub-expert models for video representation
Progressive-resolution pretraining and four-stage post-training enhancement
Diffusion forcing framework with non-decreasing noise schedules
🔎 Similar Papers
No similar papers found.
Guibin Chen
Guibin Chen
Skywork AI
Video Generative modelsReinforcement LearningGame AI
D
Dixuan Lin
J
Jiangping Yang
Chunze Lin
Chunze Lin
Sensetime Research
Deep LearningComputer Vision
J
Juncheng Zhu
Mingyuan Fan
Mingyuan Fan
Kunlun Inc
AIGC Semantic Segmentation
H
Hao Zhang
S
Sheng Chen
Z
Zheng Chen
C
Chengchen Ma
W
Weiming Xiong
W
Wei Wang
K
Kang Kang
Zhiheng Xu
Zhiheng Xu
University at Buffalo
Y
Yuzhe Jin
Y
Yupeng Liang
Y
Yubing Song
P
Peng Zhao
B
Boyuan Xu
D
Di Qiu
Debang Li
Debang Li
nlpr
Deep LearningComputer Vision
Zhengcong Fei
Zhengcong Fei
ICT, UCAS
MLLMdiffusion models
Y
Yang Li
Y
Yahui Zhou