🤖 AI Summary
This work addresses the hallucination and visual grounding issues in existing video event prediction methods, which arise from prematurely converting visual reasoning into text and thereby discarding fine-grained motion, geometric, and interaction cues. To overcome this limitation, the authors propose Future-L1, a novel framework that alternates between language tokens and continuous latent visual segments during autoregressive decoding, introducing for the first time an interleaved latent visual reasoning mechanism that preserves intermediate visual semantics to enable synergistic language-vision reasoning. Coupled with a newly curated Future-L1-50K dataset and a latent-aware reinforcement learning objective—LA-DAPO, incorporating outcome comparison and temporal diversity rewards—the method significantly boosts performance: it improves Qwen3-VL-8B’s score on FutureBench from 61.0 to 85.4, surpassing Video-CoE by 10.4 points, and raises the average score on TwiFF-Bench from 2.44 to 3.04, markedly enhancing visual consistency and prediction accuracy.
📝 Abstract
Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.