🤖 AI Summary
Existing retrieval-augmented image captioning methods suffer from two key limitations: (1) coarse-grained semantic prompting and (2) inadequate modeling of visual relationships. To address these, we propose a lightweight, relation-aware retrieval-augmented framework. First, we introduce a relation-aware prompting mechanism that explicitly extracts structured semantic relations from retrieved textual evidence. Second, we design a heterogeneous object recognition module to jointly model diverse object types and their interactions within the image. Third, we employ structured feature retrieval coupled with a lightweight fusion network to achieve fine-grained relational alignment between vision and language. The resulting model contains only 10.8 million parameters and consistently outperforms comparable lightweight baselines across multiple benchmarks. It significantly improves both semantic consistency and relational expressiveness of generated captions, demonstrating superior capability in grounding fine-grained visual relationships through retrieval-guided generation.
📝 Abstract
Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.