RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing retrieval-augmented image captioning methods suffer from two key limitations: (1) coarse-grained semantic prompting and (2) inadequate modeling of visual relationships. To address these, we propose a lightweight, relation-aware retrieval-augmented framework. First, we introduce a relation-aware prompting mechanism that explicitly extracts structured semantic relations from retrieved textual evidence. Second, we design a heterogeneous object recognition module to jointly model diverse object types and their interactions within the image. Third, we employ structured feature retrieval coupled with a lightweight fusion network to achieve fine-grained relational alignment between vision and language. The resulting model contains only 10.8 million parameters and consistently outperforms comparable lightweight baselines across multiple benchmarks. It significantly improves both semantic consistency and relational expressiveness of generated captions, demonstrating superior capability in grounding fine-grained visual relationships through retrieval-guided generation.

Technology Category

Application Category

📝 Abstract

Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.

Problem

Research questions and friction points this paper is trying to address.

Addresses coarse-grained semantic prompts in retrieval-augmented captioning

Lacks explicit modeling of image objects and relationships

Enhances semantic consistency and relational expressiveness in captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Relation-aware prompting for structured semantics

Mining heterogeneous objects from images

Lightweight model with enhanced relational expressiveness

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis