Continual Cross-Modal Generalization

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of cross-modal generalization under scarce multimodal paired data, this paper proposes a continual learning paradigm that leverages an intermediary modality as a bridge to progressively map novel modalities onto a dynamically evolving shared discrete codebook. Methodologically, it integrates discrete representation learning, cross-stage semantic alignment, and adaptive codebook evolution. Key contributions include: (i) the first continual Mixture-of-Experts adapter (CMoE-Adapter), enabling efficient, modality-specific parameter expansion; and (ii) a pseudo-modality replay (PMR) mechanism that preserves historical semantic priors during incremental codebook updates. Extensive experiments demonstrate that our approach significantly outperforms static joint-training baselines across diverse cross-modal tasks—including image–text, audio–text, video–text, and speech–text retrieval—achieving superior generalization with limited paired supervision.

Technology Category

Application Category

📝 Abstract

Cross-modal generalization aims to learn a shared discrete representation space from multimodal pairs, enabling knowledge transfer across unannotated modalities. However, achieving a unified representation for all modality pairs requires extensive paired data, which is often impractical. Inspired by the availability of abundant bimodal data (e.g., in ImageBind), we explore a continual learning approach that incrementally maps new modalities into a shared discrete codebook via a mediator modality. We propose the Continual Mixture of Experts Adapter (CMoE-Adapter) to project diverse modalities into a unified space while preserving prior knowledge. To align semantics across stages, we introduce a Pseudo-Modality Replay (PMR) mechanism with a dynamically expanding codebook, enabling the model to adaptively incorporate new modalities using learned ones as guidance. Extensive experiments on image-text, audio-text, video-text, and speech-text show that our method achieves strong performance on various cross-modal generalization tasks. Code is provided in the supplementary material.

Problem

Research questions and friction points this paper is trying to address.

Learn shared discrete representation space for multimodal pairs

Overcome impractical need for extensive paired data

Enable adaptive incorporation of new modalities incrementally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual learning for cross-modal generalization

CMoE-Adapter projects modalities into unified space

Pseudo-Modality Replay aligns semantics across stages

🔎 Similar Papers

No similar papers found.

Authors to Follow