🤖 AI Summary
To address the challenge of cross-modal generalization under scarce multimodal paired data, this paper proposes a continual learning paradigm that leverages an intermediary modality as a bridge to progressively map novel modalities onto a dynamically evolving shared discrete codebook. Methodologically, it integrates discrete representation learning, cross-stage semantic alignment, and adaptive codebook evolution. Key contributions include: (i) the first continual Mixture-of-Experts adapter (CMoE-Adapter), enabling efficient, modality-specific parameter expansion; and (ii) a pseudo-modality replay (PMR) mechanism that preserves historical semantic priors during incremental codebook updates. Extensive experiments demonstrate that our approach significantly outperforms static joint-training baselines across diverse cross-modal tasks—including image–text, audio–text, video–text, and speech–text retrieval—achieving superior generalization with limited paired supervision.
📝 Abstract
Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.