Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

๐Ÿ“… 2026-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

186K/year
๐Ÿค– AI Summary
Current long-form video language models struggle to extract fine-grained event details and model dynamic temporal evolution, limiting their generalization in event prediction. To address this, this work proposes VISTA, a novel framework that introduces multi-level event semantic mining into long video forecasting for the first time. VISTA captures critical visual details through character-centric visual prompting, constructs coherent event chains via knowledge-enhanced iterative retrieval, and integrates multi-level cues through a human-like โ€œpropose-then-retrieveโ€ mechanism to generate plausible future events. Experiments demonstrate that VISTA significantly outperforms existing methods on real-world long video datasets, achieving notable improvements in both prediction accuracy and logical coherence.
๐Ÿ“ Abstract
Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.
Problem

Research questions and friction points this paper is trying to address.

long-video event prediction
event semantics
multimodal context
narrative complexity
future event forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-video event prediction
multi-level semantics mining
character-centric visual prompt
knowledge-enhanced retrieval
propose-then-retrieve