What Happens Next? Next Scene Prediction with a Unified Video Model

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing unified video models primarily focus on conventional tasks such as text-to-video generation, with limited systematic investigation into temporal and causal reasoning capabilities. Method: We introduce Next Scene Prediction (NSP), a novel task requiring models to infer and generate semantically coherent, causally plausible future scenes from preceding video segments. We formally define NSP for the first time; propose a cross-modal latent query bridging framework that jointly optimizes understanding (Qwen-VL) and generation (LTX) modules; and incorporate a causal consistency reward mechanism within a three-stage training paradigm—pretraining, supervised fine-tuning, and GRPO-based reinforcement learning. Contribution/Results: Evaluated on our newly constructed large-scale NSP benchmark, our approach achieves state-of-the-art performance, significantly advancing temporal modeling and causal reasoning for future event prediction.

Technology Category

Application Category

📝 Abstract

Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.

Problem

Research questions and friction points this paper is trying to address.

Predict plausible future scenes from preceding video context

Advance unified video models for temporal and causal reasoning

Bridge comprehension and synthesis for next scene prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework combining Qwen-VL and LTX models

Three-stage training with causal consistency reward

Latent query embedding and connector module bridging

🔎 Similar Papers

No similar papers found.

Authors to Follow