🤖 AI Summary
To address insufficient cross-modal and cross-task interaction modeling in continual vision-language retrieval, this paper proposes a low-rank prompt interaction mechanism for efficient and robust multi-stage joint learning. Our approach introduces three key innovations: (1) the first low-rank interaction-enhanced decomposition, which disentangles intra- and inter-modal dynamic dependencies; (2) hierarchical low-rank contrastive learning, incorporating explicit task-level contrastive constraints based on semantic distance to strengthen cross-task knowledge transfer; and (3) lightweight integration of multimodal relevance modeling with Transformer adaptation. Evaluated on two continual retrieval benchmarks, our method significantly outperforms state-of-the-art continual learning and multimodal models, achieving superior performance with minimal parameter overhead (< 0.5M additional parameters). The source code is publicly available.
📝 Abstract
Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal and cross-task interactions. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition to avoid memory explosion while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ a visual analysis and identify that different tasks have clear distinctions in proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distances. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method. Code is available at https://github.com/Kelvin-ywc/LPI.