Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses key limitations in existing multimodal summarization methods—namely, their reliance on domain-specific supervision, insufficient cross-modal fusion, and lack of event-level temporal modeling. To overcome these challenges, the authors propose the CoE framework, which introduces, for the first time, hierarchical event graph (HEG)-guided event chain reasoning to enable training-free multimodal summarization. By leveraging an explicit event hierarchy, the approach facilitates cross-modal alignment and causal temporal reasoning, while integrating key visual cue localization and lightweight style adaptation. The method achieves strong interpretability and cross-domain generalization, significantly outperforming state-of-the-art approaches across eight benchmark datasets, with average gains of +3.04 in ROUGE, +9.51 in CIDEr, and +1.88 in BERTScore.

Technology Category

Application Category

📝 Abstract

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Summarization

Cross-modal Grounding

Temporal Modeling

Event Transitions

Domain-specific Supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Events

Hierarchical Event Graph

Training-free Multimodal Summarization