ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge in multimodal continual learning where large language models, relying solely on image-text similarity for task routing, often assign tasks with substantially different output structures to the same expert, leading to gradient interference and degraded collaboration. To resolve this, the authors propose a prototype-guided adaptive tuning framework that introduces, for the first time, format-aware task prototypes incorporating both semantic and output-structure information to enable precise routing. Coupled with a geometry-aware parameter consolidation mechanism, the method efficiently reuses and refines parameters within a LoRA-based sparse architecture. Extensive experiments demonstrate that the proposed approach significantly outperforms existing methods across multiple multimodal continual learning benchmarks, particularly excelling on tasks sensitive to output structure.

📝 Abstract

Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Continual Instruction Tuning

task routing

response structure

gradient interference

format-aware assignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

prototype-guided routing

format-aware task assignment

adaptive adapter expansion