🤖 AI Summary
Current generative vision-language models suffer from degraded instruction-following capability during continual learning, primarily due to updates to the visual projector that weaken alignment with linguistic instructions—especially under repeated text指令. This paper proposes an instruction-aligned visual projector continual learning framework. Our method addresses this via: (1) an instruction-context-aware visual projector mixture architecture that enables dynamic mapping of visual features into the language space; and (2) a lightweight adaptive mechanism combining expert routing and pruning to mitigate cross-task knowledge interference. Evaluated on a multi-stage vision-language continual learning benchmark, our approach significantly improves instruction adherence and task generalization. Generated outputs exhibit higher fidelity to linguistic instructions, achieving superior accuracy and robustness over existing state-of-the-art methods.
📝 Abstract
Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.