Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current generative vision-language models suffer from degraded instruction-following capability during continual learning, primarily due to updates to the visual projector that weaken alignment with linguistic instructions—especially under repeated text指令. This paper proposes an instruction-aligned visual projector continual learning framework. Our method addresses this via: (1) an instruction-context-aware visual projector mixture architecture that enables dynamic mapping of visual features into the language space; and (2) a lightweight adaptive mechanism combining expert routing and pruning to mitigate cross-task knowledge interference. Evaluated on a multi-stage vision-language continual learning benchmark, our approach significantly improves instruction adherence and task generalization. Generated outputs exhibit higher fidelity to linguistic instructions, achieving superior accuracy and robustness over existing state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.

Problem

Research questions and friction points this paper is trying to address.

Addresses neglect of language instructions in continual learning

Proposes instruction-grounded visual projectors for VLMs

Mitigates interference from repetitive textual instruction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-grounded visual projectors for VLMs

Mixture of specialized visual translation experts

Expert recommendation and pruning strategy

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling