Caption, Create, Continue: Continual Learning with Pre-trained Generative Vision-Language Models

📅 2024-09-26

🏛️ Proceedings of the 34th ACM International Conference on Information and Knowledge Management

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address severe catastrophic forgetting and reliance on raw data storage in class-incremental continual learning, this paper proposes a replay-free, memory-efficient generative continual learning paradigm. Instead of storing any real samples, it leverages pre-trained vision-language models (BLIP and Stable Diffusion) to enable text-guided, task-adaptive image regeneration. A learnable Task Router and dedicated Task Heads are introduced to support dynamic task routing and modular knowledge isolation. This work is the first to deeply integrate multimodal text-image co-generation into a continual learning framework. Evaluated on three standard benchmarks, the method achieves an average accuracy improvement of 54% over prior approaches, while reducing memory footprint by 63× compared to four state-of-the-art methods. It significantly enhances knowledge retention and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract

Continual learning (CL) enables models to adapt to evolving data streams without catastrophic forgetting, a fundamental requirement for real-world AI systems. However, the current methods often depend on large replay buffers or heavily annotated datasets which are impractical due to storage, privacy, and cost constraints. We propose CLTS (Continual Learning via Text-Image Synergy), a novel class-incremental framework that mitigates forgetting without storing real task data. CLTS leverages pre-trained vision-language models, BLIP (Bootstrapping Language-Image Pre-training) for caption generation and stable diffusion for sample generation. Each task is handled by a dedicated Task Head, while a Task Router learns to assign inputs to the correct Task Head using the generated data. On three benchmark datasets, CLTS improves average task accuracy by up to 54% and achieves 63 times better memory efficiency compared to four recent continual learning baselines, demonstrating improved retention and adaptability. CLTS introduces a novel perspective by integrating generative text-image augmentation for scalable continual learning.

Problem

Research questions and friction points this paper is trying to address.

Mitigates catastrophic forgetting in continual learning without storing real task data

Overcomes impractical reliance on large replay buffers and heavily annotated datasets

Enables models to adapt to evolving data streams under storage and privacy constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained vision-language models for caption generation

Uses stable diffusion for generating synthetic training samples

Employs task router with generated data for class assignment

🔎 Similar Papers

No similar papers found.

Authors to Follow