Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing multimodal summarization methods suffer from weak cross-modal alignment and insufficient utilization of visual information due to the mismatch between shallow visual features and the depth of language models. To address this, this work proposes a unified framework that jointly generates textual summaries and selects representative images. The approach employs a Deep Visual Processor (DVP) to enable hierarchical cross-modal alignment and integrates a gated attention mechanism to enhance semantic fusion. Furthermore, a lightweight visual relevance predictor is designed, leveraging knowledge distillation from a Determinantal Point Process (DPP)-based teacher model to select diverse and salient images. Trained under a multi-objective joint optimization scheme, the proposed method significantly improves semantic accuracy, visual relevance, and image representativeness, achieving state-of-the-art performance on multimodal summarization benchmarks.

📝 Abstract

Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.

Problem

Research questions and friction points this paper is trying to address.

multimodal summarization

visual grounding

cross-modal alignment

representational mismatch

image selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Transformer

Gated Attention

Deep Visual Processor