How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates how multimodal foundation models unify semantic representations across cross-lingual text and speech. To this end, we construct cross-lingual paraphrase sentence pairs and analyze layer-wise internal activations using representational similarity analysis (RSA) and canonical correlation analysis (CCA), complemented by length-normalized contrastive experiments. Our study yields three key findings: (1) cross-modal representations exhibit hierarchical convergence across model layers; (2) length adaptation is the critical mechanism bridging text–speech representational disparities; and (3) cross-lingual variation in speech is substantially greater than in text, with the “modality gap” vastly exceeding the “language gap” in models lacking explicit alignment. The results empirically validate the hierarchical abstraction of representation spaces, delineate the effectiveness boundary of length adaptation for high-resource languages, and provide interpretable insights for multimodal alignment and low-resource speech understanding.

Technology Category

Application Category

📝 Abstract

Multimodal foundation models aim to create a unified representation space that abstracts away from surface features like language syntax or modality differences. To investigate this, we study the internal representations of three recent models, analyzing the model activations from semantically equivalent sentences across languages in the text and speech modalities. Our findings reveal that: 1) Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. 2) Length adaptation is crucial for reducing the cross-modal gap between text and speech, although current approaches' effectiveness is primarily limited to high-resource languages. 3) Speech exhibits larger cross-lingual differences than text. 4) For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.

Problem

Research questions and friction points this paper is trying to address.

Analyze cross-modal representations in models

Study text and speech encoding convergence

Evaluate cross-lingual differences in speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal representation space

Layer convergence for cross-modality

Length adaptation reduces modality gap

🔎 Similar Papers

No similar papers found.

Authors to Follow