Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

๐Ÿ“… 2026-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key limitations in existing multimodal summarization methodsโ€”namely, their reliance on domain-specific supervision, insufficient cross-modal fusion, and lack of event-level temporal modeling. To overcome these challenges, the authors propose the CoE framework, which introduces, for the first time, hierarchical event graph (HEG)-guided event chain reasoning to enable training-free multimodal summarization. By leveraging an explicit event hierarchy, the approach facilitates cross-modal alignment and causal temporal reasoning, while integrating key visual cue localization and lightweight style adaptation. The method achieves strong interpretability and cross-domain generalization, significantly outperforming state-of-the-art approaches across eight benchmark datasets, with average gains of +3.04 in ROUGE, +9.51 in CIDEr, and +1.88 in BERTScore.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Summarization
Cross-modal Grounding
Temporal Modeling
Event Transitions
Domain-specific Supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Events
Hierarchical Event Graph
Training-free Multimodal Summarization
Cross-modal Grounding
Temporal Reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.