Cross-lingual Matryoshka Representation Learning across Speech and Text

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the dual challenge faced by speakers of low-resource languages—such as Wolof—who encounter dominant knowledge primarily in high-resource written languages while their native language is predominantly oral. To bridge this gap, the authors propose the first French–Wolof bilingual speech–text Matryoshka embedding model. This approach introduces Matryoshka representation learning to cross-lingual speech–text retrieval for the first time, enabling direct retrieval of French text from Wolof speech without relying on costly automatic speech recognition (ASR) and machine translation pipelines. By fusing modalities while keeping the pretrained text encoder frozen, the model achieves strong performance on cross-modal retrieval and generalizes effectively to downstream tasks such as spoken intent classification. Additionally, the study reveals that semantic information concentrates in a small subset of embedding dimensions, offering insights for efficient inference and deployment.

Technology Category

Application Category

📝 Abstract

Speakers of under-represented languages face both a language barrier, as most online knowledge is in a few dominant languages, and a modality barrier, since information is largely text-based while many languages are primarily oral. We address this for French-Wolof by training the first bilingual speech-text Matryoshka embedding model, enabling efficient retrieval of French text from Wolof speech queries without relying on a costly ASR-translation pipelines. We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best. Although trained only for retrieval, the model generalizes well to other tasks, such as speech intent detection, indicating the learning of general semantic representations. Finally, we analyze cost-accuracy trade-offs across Matryoshka dimensions and ranks, showing that information is concentrated only in a few components, suggesting potential for efficiency improvements.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual

speech-text retrieval

under-represented languages

modality barrier

language barrier

Innovation

Methods, ideas, or system contributions that make the work stand out.

Matryoshka representation

cross-lingual speech-text retrieval

modality fusion

low-resource languages

embedding efficiency

🔎 Similar Papers

No similar papers found.

Authors to Follow