AGIC: Attention-Guided Image Captioning to Improve Caption Relevance

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Image captioning still suffers from insufficient accuracy and descriptiveness. To address this, we propose the Attention-Guided Image Captioning (AGIC) framework, which explicitly enhances attention to salient regions within the visual feature space to improve caption relevance and fine-grained detail fidelity. Furthermore, we introduce a hybrid decoding strategy that jointly leverages deterministic and probabilistic sampling to balance linguistic fluency and descriptive diversity. Evaluated on Flickr8k and Flickr30k, AGIC consistently outperforms multiple state-of-the-art models across BLEU, METEOR, and CIDEr metrics, while achieving significantly faster inference speed. Our key contributions are: (1) an interpretable, feature-space attention guidance mechanism that improves visual grounding; and (2) a hybrid decoding paradigm that jointly optimizes caption quality and computational efficiency.

Technology Category

Application Category

📝 Abstract

Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.

Problem

Research questions and friction points this paper is trying to address.

Improving caption relevance in image captioning models

Balancing fluency and diversity in caption generation

Enhancing interpretability and scalability of captioning systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention-guided feature amplification for relevance

Hybrid decoding balances fluency and diversity

Scalable interpretable solution for image captioning

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis