Caption, Create, Continue: Continual Learning with Pre-trained Generative Vision-Language Models

📅 2024-09-26
🏛️ Proceedings of the 34th ACM International Conference on Information and Knowledge Management
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe catastrophic forgetting and reliance on raw data storage in class-incremental continual learning, this paper proposes a replay-free, memory-efficient generative continual learning paradigm. Instead of storing any real samples, it leverages pre-trained vision-language models (BLIP and Stable Diffusion) to enable text-guided, task-adaptive image regeneration. A learnable Task Router and dedicated Task Heads are introduced to support dynamic task routing and modular knowledge isolation. This work is the first to deeply integrate multimodal text-image co-generation into a continual learning framework. Evaluated on three standard benchmarks, the method achieves an average accuracy improvement of 54% over prior approaches, while reducing memory footprint by 63× compared to four state-of-the-art methods. It significantly enhances knowledge retention and cross-task generalization capability.

Technology Category

Application Category

📝 Abstract
Continual learning (CL) enables models to adapt to evolving data streams without catastrophic forgetting, a fundamental requirement for real-world AI systems. However, the current methods often depend on large replay buffers or heavily annotated datasets which are impractical due to storage, privacy, and cost constraints. We propose CLTS (Continual Learning via Text-Image Synergy), a novel class-incremental framework that mitigates forgetting without storing real task data. CLTS leverages pre-trained vision-language models, BLIP (Bootstrapping Language-Image Pre-training) for caption generation and stable diffusion for sample generation. Each task is handled by a dedicated Task Head, while a Task Router learns to assign inputs to the correct Task Head using the generated data. On three benchmark datasets, CLTS improves average task accuracy by up to 54% and achieves 63 times better memory efficiency compared to four recent continual learning baselines, demonstrating improved retention and adaptability. CLTS introduces a novel perspective by integrating generative text-image augmentation for scalable continual learning.
Problem

Research questions and friction points this paper is trying to address.

Mitigates catastrophic forgetting in continual learning without storing real task data
Overcomes impractical reliance on large replay buffers and heavily annotated datasets
Enables models to adapt to evolving data streams under storage and privacy constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained vision-language models for caption generation
Uses stable diffusion for generating synthetic training samples
Employs task router with generated data for class assignment
🔎 Similar Papers
No similar papers found.
I
Indu Solomon
International Institute of Information Technology Bangalore (IIITB), India
Aye Phyu Phyu Aung
Aye Phyu Phyu Aung
Institute for Infocomm Research (I2R)
Generative ModelsReinforcement Learning
Uttam Kumar
Uttam Kumar
International Institute of Information Technology Bangalore (IIITB), India
S
Senthilnath Jayavelu
Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore