CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the bottleneck where cross-modal alignment and multilingual capability expansion traditionally require large-scale multimodal/multilingual pretraining, this paper proposes CACARA—a text-centric cross-modal alignment architecture. Its core innovation lies in enabling emergent audio–text retrieval capabilities across 100 languages by fine-tuning only the newly introduced modality encoder on English-aligned data, while keeping the pretrained text encoder frozen. CACARA integrates parameter-efficient fine-tuning with a monolingual-to-multilingual transfer mechanism, achieving low-cost capability extension without compromising original knowledge. Extensive experiments demonstrate that CACARA achieves up to a 14.24-percentage-point improvement in Recall@1 on audio–text retrieval tasks, outperforming state-of-the-art multimodal models, while maintaining training costs comparable to monolingual baselines.

Technology Category

Application Category

📝 Abstract
As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models -- all without the heavy computational cost of retraining across every modality and language.
Problem

Research questions and friction points this paper is trying to address.

Integrates new modalities without full model retraining
Enables multilingual support from monolingual training data
Reduces computational cost while improving retrieval performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal alignment via text-centric emergent learning
Adds new modalities without full model retraining
Enables multilingual support from monolingual training only
🔎 Similar Papers
No similar papers found.
D
Diego A. B. Moreira
Instituto de Computação, Universidade Estadual de Campinas (UNICAMP), Brasil
A
Alef I. Ferreira
Instituto de Informática, Universidade Federal de Goiás (UFG), Goiás, Brasil
Jhessica Silva
Jhessica Silva
Instituto de Computação, Universidade Estadual de Campinas
Ética em Inteligência ArtificialProcessamento de Linguagem Natural
G
Gabriel O. dos Santos
Instituto de Computação, Universidade Estadual de Campinas (UNICAMP), Brasil
Gustavo Bonil
Gustavo Bonil
Universidade Estadual de Campinas
J
João Gondim
Instituto de Computação, Universidade Estadual de Campinas (UNICAMP), Brasil
M
Marina dos Santos
Instituto de Estudos da Linguagem, Universidade Estadual de Campinas (UNICAMP), Brasil
Helena Maia
Helena Maia
University of Campinas
computer visionmachine learningimage processing
S
Simone Hashiguti
Instituto de Estudos da Linguagem, Universidade Estadual de Campinas (UNICAMP), Brasil
N
Nádia da Silva
Instituto de Informática, Universidade Federal de Goiás (UFG), Goiás, Brasil
Carolina Scarton
Carolina Scarton
Senior Lecturer in Natural Language Processing, NLP group / GATE group, University of Sheffield
Social Media AnalysisText SimplificationMachine TranslationNatural Language ProcessingArtificial Intelligence
H
Helio Pedrini
Instituto de Computação, Universidade Estadual de Campinas (UNICAMP), Brasil
Sandra Avila
Sandra Avila
Professor of Computer Science, University of Campinas (Unicamp)
Machine LearningDeep LearningComputer VisionNatural Language Processing