Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the bottleneck in unsupervised image captioning—its reliance on manually aligned image–text pairs—by proposing TOMCap, a training-free method that leverages only raw text corpora. Methodologically, TOMCap jointly exploits CLIP’s cross-modal representations and retrieval-augmented prompting: it retrieves semantically relevant textual exemplars and aligns them with image latent vectors in CLIP’s shared embedding space; a modality-gap correction mechanism further guides a pretrained language model decoder to generate high-quality captions. This approach effectively bridges the vision–language modality gap, enabling zero-shot image captioning. Extensive experiments on Flickr30K and COCO demonstrate that TOMCap significantly outperforms existing training-free and text-only baselines. The results validate both the efficacy and generalizability of retrieval augmentation coupled with latent-space alignment for unsupervised multimodal generation.

Technology Category

Application Category

📝 Abstract
Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.
Problem

Research questions and friction points this paper is trying to address.

Reduces reliance on human-annotated image-caption pairs
Corrects modality gap between text and image representations
Enhances captioning using retrieval-augmented text-only training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-only training with retrieval-augmented caption examples
Modality gap reduction for CLIP-guided language model prompting
Training-free image captioning without aligned image-text pairs
🔎 Similar Papers
No similar papers found.