Cross-lingual Matryoshka Representation Learning across Speech and Text

📅 2026-02-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the dual challenge faced by speakers of low-resource languages—such as Wolof—who encounter dominant knowledge primarily in high-resource written languages while their native language is predominantly oral. To bridge this gap, the authors propose the first French–Wolof bilingual speech–text Matryoshka embedding model. This approach introduces Matryoshka representation learning to cross-lingual speech–text retrieval for the first time, enabling direct retrieval of French text from Wolof speech without relying on costly automatic speech recognition (ASR) and machine translation pipelines. By fusing modalities while keeping the pretrained text encoder frozen, the model achieves strong performance on cross-modal retrieval and generalizes effectively to downstream tasks such as spoken intent classification. Additionally, the study reveals that semantic information concentrates in a small subset of embedding dimensions, offering insights for efficient inference and deployment.

Technology Category

Application Category

📝 Abstract
Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.
Problem

Research questions and friction points this paper is trying to address.

cross-lingual
speech-text retrieval
under-represented languages
modality barrier
language barrier
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matryoshka representation
cross-lingual speech-text retrieval
modality fusion
low-resource languages
embedding efficiency
🔎 Similar Papers
No similar papers found.