🤖 AI Summary
Existing unified video models primarily focus on conventional tasks such as text-to-video generation, with limited systematic investigation into temporal and causal reasoning capabilities. Method: We introduce Next Scene Prediction (NSP), a novel task requiring models to infer and generate semantically coherent, causally plausible future scenes from preceding video segments. We formally define NSP for the first time; propose a cross-modal latent query bridging framework that jointly optimizes understanding (Qwen-VL) and generation (LTX) modules; and incorporate a causal consistency reward mechanism within a three-stage training paradigm—pretraining, supervised fine-tuning, and GRPO-based reinforcement learning. Contribution/Results: Evaluated on our newly constructed large-scale NSP benchmark, our approach achieves state-of-the-art performance, significantly advancing temporal modeling and causal reasoning for future event prediction.
📝 Abstract
Recent unified models for joint understanding and generation have significantly advanced visual generation capabilities. However, their focus on conventional tasks like text-to-video generation has left the temporal reasoning potential of unified models largely underexplored. To address this gap, we introduce Next Scene Prediction (NSP), a new task that pushes unified video models toward temporal and causal reasoning. Unlike text-to-video generation, NSP requires predicting plausible futures from preceding context, demanding deeper understanding and reasoning. To tackle this task, we propose a unified framework combining Qwen-VL for comprehension and LTX for synthesis, bridged by a latent query embedding and a connector module. This model is trained in three stages on our newly curated, large-scale NSP dataset: text-to-video pre-training, supervised fine-tuning, and reinforcement learning (via GRPO) with our proposed causal consistency reward. Experiments demonstrate our model achieves state-of-the-art performance on our benchmark, advancing the capability of generalist multimodal systems to anticipate what happens next.