🤖 AI Summary
To address the bottleneck where cross-modal alignment and multilingual capability expansion traditionally require large-scale multimodal/multilingual pretraining, this paper proposes CACARA—a text-centric cross-modal alignment architecture. Its core innovation lies in enabling emergent audio–text retrieval capabilities across 100 languages by fine-tuning only the newly introduced modality encoder on English-aligned data, while keeping the pretrained text encoder frozen. CACARA integrates parameter-efficient fine-tuning with a monolingual-to-multilingual transfer mechanism, achieving low-cost capability extension without compromising original knowledge. Extensive experiments demonstrate that CACARA achieves up to a 14.24-percentage-point improvement in Recall@1 on audio–text retrieval tasks, outperforming state-of-the-art multimodal models, while maintaining training costs comparable to monolingual baselines.
📝 Abstract
As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models -- all without the heavy computational cost of retraining across every modality and language.