🤖 AI Summary
This paper introduces the novel task of *long-horizon future narrative generation*, which aims to produce coherent, natural-language descriptions of daily activities occurring over the next several minutes, conditioned on egocentric video input—targeting applications in health monitoring, smart homes, and behavioral analysis. To address this, we propose ViNa, the first end-to-end vision–language model for this task, integrating long-sequence video encoding, cross-modal temporal alignment, and autoregressive narrative decoding. We further introduce *future video retrieval* as a new downstream application to enable interpretable, temporally grounded task planning visualization. Evaluated on the Ego4D dataset, ViNa substantially outperforms short-horizon prediction baselines, achieving state-of-the-art performance. Generated narratives exhibit high temporal consistency and activity plausibility, marking the first successful semantic modeling of minute-scale future behavior in realistic, everyday settings.
📝 Abstract
Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $ extit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.