Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
News image captioning faces three key challenges: incomplete information coverage, weak cross-modal alignment, and poor visual entity association. To address these, we propose EM-RAG—the first entity-aware multimodal retrieval-augmented generation framework. EM-RAG constructs an Entity-Centric Multimodal Knowledge Base (EMKB) and integrates a dynamic retrieval mechanism with a multi-stage hypothesis–caption alignment strategy, enabling deep fusion of visual content, news text, and structured entity knowledge. This design significantly improves cross-modal alignment accuracy and visual entity localization capability. On GoodNews and NYTimes800k, EM-RAG achieves +6.84 and +1.16 CIDEr gains, and +4.14 and +2.64 F1-score improvements, respectively. Notably, under zero-shot transfer to Visual News, it still attains +20.17 CIDEr and +6.22 F1-score gains, demonstrating strong generalizability and practical utility.

Technology Category

Application Category

📝 Abstract
News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.
Problem

Research questions and friction points this paper is trying to address.

Addresses incomplete information coverage in news image captioning
Improves weak cross-modal alignment between images and text
Enhances suboptimal visual-entity grounding in captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-centric multimodal knowledge base for enriched retrieval
Multistage hypothesis-caption strategy for cross-modal alignment
Dynamic retrieval guided by image content for entity matching
🔎 Similar Papers
No similar papers found.
X
Xiaoxing You
Hangzhou Dianzi University, Hangzhou, China
Q
Qiang Huang
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Lingyu Li
Lingyu Li
Shanghai Jiao Tong University
Active inferenceArtificial Intelligencephilosophy
C
Chi Zhang
People’s Daily, Beijing, China
X
Xiaopeng Liu
People’s Daily, Beijing, China
M
Min Zhang
Harbin Institute of Technology (Shenzhen), Shenzhen, China
J
Jun Yu
Harbin Institute of Technology (Shenzhen), Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China