What Happens Next? Next Scene Prediction with a Unified Video Model

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified video models primarily focus on conventional tasks such as text-to-video generation, with limited systematic investigation into temporal and causal reasoning capabilities. Method: We introduce Next Scene Prediction (NSP), a novel task requiring models to infer and generate semantically coherent, causally plausible future scenes from preceding video segments. We formally define NSP for the first time; propose a cross-modal latent query bridging framework that jointly optimizes understanding (Qwen-VL) and generation (LTX) modules; and incorporate a causal consistency reward mechanism within a three-stage training paradigm—pretraining, supervised fine-tuning, and GRPO-based reinforcement learning. Contribution/Results: Evaluated on our newly constructed large-scale NSP benchmark, our approach achieves state-of-the-art performance, significantly advancing temporal modeling and causal reasoning for future event prediction.

Technology Category

Application Category

📝 Abstract
Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.
Problem

Research questions and friction points this paper is trying to address.

Predict plausible future scenes from preceding video context
Advance unified video models for temporal and causal reasoning
Bridge comprehension and synthesis for next scene prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework combining Qwen-VL and LTX models
Three-stage training with causal consistency reward
Latent query embedding and connector module bridging
🔎 Similar Papers
No similar papers found.
X
Xinjie Li
Pennsylvania State University, USA
Z
Zhimin Chen
Amazon, USA
R
Rui Zhao
Amazon, USA
Florian Schiffers
Florian Schiffers
Northwestern University
Computer VisionComputational PhotographyMedical ImagingMachine Learning
Z
Zhenyu Liao
Amazon, USA
V
Vimal Bhat
Amazon, USA