SkyReels-V2: Infinite-length Film Generative Model

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current video generation models face fundamental trade-offs among prompt adherence, visual fidelity, motion dynamics, and maximum generation duration: improving image quality often compromises camera motion naturalness, while temporal extent remains capped at 5–10 seconds; moreover, general-purpose multimodal large language models (MLLMs) lack the capacity to interpret cinematic grammar—including shot composition, actor micro-expressions, and camera motion. To address these limitations, we propose a structured video representation framework that synergistically integrates MLLMs with specialized storyboard models. We introduce diffusion-forced training and a non-increasing noise scheduling strategy to enable efficient, arbitrarily long video synthesis. Through a four-stage post-training pipeline—including phased pretraining and human feedback-based reinforcement learning—we significantly enhance long-horizon temporal consistency and cinematographic realism. Our approach supports high-fidelity, film-grade video generation exceeding 100 seconds. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at https://github.com/SkyworkAI/SkyReels-V2.

Problem

Research questions and friction points this paper is trying to address.

Improving motion dynamics and visual quality in video generation

Extending video duration beyond current 5-10 second limits

Enhancing shot-aware generation with cinematic grammar understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal LLM and sub-expert models for video representation

Progressive-resolution pretraining and four-stage post-training enhancement

Diffusion forcing framework with non-decreasing noise schedules

🔎 Similar Papers

No similar papers found.

Authors to Follow