Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current generative vision-language models suffer from degraded instruction-following capability during continual learning, primarily due to updates to the visual projector that weaken alignment with linguistic instructions—especially under repeated text指令. This paper proposes an instruction-aligned visual projector continual learning framework. Our method addresses this via: (1) an instruction-context-aware visual projector mixture architecture that enables dynamic mapping of visual features into the language space; and (2) a lightweight adaptive mechanism combining expert routing and pruning to mitigate cross-task knowledge interference. Evaluated on a multi-stage vision-language continual learning benchmark, our approach significantly improves instruction adherence and task generalization. Generated outputs exhibit higher fidelity to linguistic instructions, achieving superior accuracy and robustness over existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.
Problem

Research questions and friction points this paper is trying to address.

Addresses neglect of language instructions in continual learning
Proposes instruction-grounded visual projectors for VLMs
Mitigates interference from repetitive textual instruction tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-grounded visual projectors for VLMs
Mixture of specialized visual translation experts
Expert recommendation and pruning strategy
🔎 Similar Papers