Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech language models (SLMs) suffer from a substantial modality gap between speech and text representations and exhibit weak cross-dataset generalization. To address this, we propose Optimal Transport Regularization (OTR), which formulates speech–text embedding alignment as an unsupervised optimal transport problem. OTR introduces a regularization loss that establishes structured cross-modal correspondences in the embedding space—without requiring additional annotations or learnable parameters. Crucially, it is parameter-free and architecture-agnostic, seamlessly integrating with any ASR model. Evaluated on multilingual ASR benchmarks, OTR significantly improves speech–text alignment accuracy, reducing average word error rate by 2.1%. Moreover, it enhances out-of-domain generalization by up to 14.3%, demonstrating its effectiveness in mitigating the speech–text modality gap.

Technology Category

Application Category

📝 Abstract
Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.
Problem

Research questions and friction points this paper is trying to address.

SLMs struggle to generalize across datasets due to speech-text modality gap
High variability in speech embeddings hinders SLM generalization performance
Optimal Transport Regularization improves speech-text alignment in SLM training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Transport Regularization for speech-text alignment
Lightweight method with no extra parameters
Improves SLM generalization across datasets
🔎 Similar Papers
No similar papers found.
W
Wenze Xu
Mashang Consumer Finance Co., Ltd., Chongqing, China and The University of Sydney, Sydney, Australia
Chun Wang
Chun Wang
Mashang Consumer Finance Co., Ltd., Chongqing, China
J
Jiazhen Yu
Macau University of Science and Technology, Macau SAR, China
S
Sheng Chen
Mashang Consumer Finance Co., Ltd., Chongqing, China
Liang Gao
Liang Gao
Associate Professor, Bioengineering, UCLA
Biomedical opticsUltrafast optical imagingComputational Optical Imaging
Weihong Deng
Weihong Deng
Professor, Beijing University of Posts and Telecommunications
Multimodal LearningTrustworthy AIAffective computingBiometrics