Continual Learning for Image Captioning through Improved Image-Text Alignment

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

To address catastrophic forgetting and degraded vision–language semantic alignment in continual image captioning, this paper proposes a semantic-guided multi-loss continual learning framework. Methodologically, built upon the ViT-GPT2 architecture, it jointly optimizes cross-entropy loss, prompt-driven cosine similarity loss, CLIP-style cross-modal alignment loss, and language-guided triplet contrastive loss; synthetic semantic prompts are introduced to enhance inter-class separability, enabling dynamic alignment without inference-time prompting. The key innovation lies in the first integration of prompt learning with multi-granularity contrastive alignment for continual caption generation—effectively balancing knowledge retention and semantic consistency. Experiments demonstrate substantial improvements over state-of-the-art methods across multiple continual learning benchmarks, with marked gains in caption quality and alignment accuracy, while incurring zero inference overhead.

Technology Category

Application Category

📝 Abstract

Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link https://github.com/ Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

Problem

Research questions and friction points this paper is trying to address.

Addresses catastrophic forgetting in continual image captioning

Improves alignment between evolving visual concepts and language

Enhances semantic caption quality without inference overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-loss framework integrates prompt-based continual learning

Combines cross-entropy loss with three alignment components

Uses contrastive alignment without inference overhead

🔎 Similar Papers

Linear Alignment of Vision-language Models for Image Captioning