🤖 AI Summary
This work investigates how multimodal foundation models unify semantic representations across cross-lingual text and speech. To this end, we construct cross-lingual paraphrase sentence pairs and analyze layer-wise internal activations using representational similarity analysis (RSA) and canonical correlation analysis (CCA), complemented by length-normalized contrastive experiments. Our study yields three key findings: (1) cross-modal representations exhibit hierarchical convergence across model layers; (2) length adaptation is the critical mechanism bridging text–speech representational disparities; and (3) cross-lingual variation in speech is substantially greater than in text, with the “modality gap” vastly exceeding the “language gap” in models lacking explicit alignment. The results empirically validate the hierarchical abstraction of representation spaces, delineate the effectiveness boundary of length adaptation for high-resource languages, and provide interpretable insights for multimodal alignment and low-resource speech understanding.
📝 Abstract
Multimodal foundation models aim to create a unified representation space that abstracts away from surface features like language syntax or modality differences. To investigate this, we study the internal representations of three recent models, analyzing the model activations from semantically equivalent sentences across languages in the text and speech modalities. Our findings reveal that: 1) Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. 2) Length adaptation is crucial for reducing the cross-modal gap between text and speech, although current approaches' effectiveness is primarily limited to high-resource languages. 3) Speech exhibits larger cross-lingual differences than text. 4) For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.