🤖 AI Summary
Existing parameter-efficient fine-tuning (PEFT) methods improve downstream few-shot retrieval performance but neglect structural consistency across multimodal embedding spaces, leading to isolated modality representations and limited cross-modal generalization. To address this, we propose SPANER—a modality-agnostic PEFT framework—whose core innovation is a shared prompt serving as a cross-modal conceptual anchor. This anchor explicitly maps heterogeneous inputs into a unified semantic space and models inter-modal structural relationships. By enforcing geometric consistency in the embedding space, SPANER enables zero-shot integration of novel modalities (e.g., audio). Trained jointly on vision–language and audio–vision alignment benchmarks, SPANER achieves state-of-the-art performance on multiple few-shot cross-modal retrieval tasks. Moreover, it significantly enhances semantic coherence and cross-modal alignment quality, demonstrating superior generalization and representational unity across diverse modalities.
📝 Abstract
Recent advances in multimodal Parameter-Efficient Fine-Tuning (PEFT) have significantly improved performance on downstream tasks such as few-shot retrieval. However, most existing approaches focus on task-specific gains while neglecting the structure of the multimodal embedding space. As a result, modality-specific representations often remain isolated, limiting cross-modal generalisation. In this work, we introduce Shared Prompt AligNER (SPANER), a modality-agnostic PEFT framework designed to embed inputs from diverse modalities into a unified semantic space. At its core, SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality. This shared prompt design is inherently extensible, supporting the seamless integration of additional modalities, such as audio, without altering the core architecture. Through comprehensive experiments across vision-language and audio-visual benchmarks, SPANER demonstrates competitive few-shot retrieval performance while preserving high semantic coherence in the learned embedding space. Our results highlight the importance of aligning embedding structures, rather than merely tuning adapter weights, for scalable multimodal learning.