Preservation of Language Understanding Capabilities in Speech-aware Large Language Models

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models’ linguistic understanding degrades under speech input. To address this, we introduce C3T—the first dedicated benchmark integrating text-understanding tasks, controllable voice cloning, and a cross-modal evaluation framework—to systematically quantify performance degradation and speaker-invariance under speech input. C3T decouples phonetic attributes (e.g., timbre, speaking rate, accent) from semantic content, enabling fine-grained assessment of cross-modal robustness and fairness. Experimental results reveal substantial comprehension deterioration across mainstream speech-language models, with performance significantly influenced by speaker attributes—including gender, age, and accent—exposing latent biases and fragility. C3T thus provides a reproducible, interpretable, and attribute-aware evaluation standard for speech–language joint modeling.

Technology Category

Application Category

📝 Abstract
The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.
Problem

Research questions and friction points this paper is trying to address.

Assessing language understanding preservation in speech-aware LLMs
Quantifying model fairness across different speaker categories
Evaluating robustness across text and speech modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal benchmark for speech-aware models
Voice cloning TTS to test speech input
Quantifies fairness and robustness across modalities
🔎 Similar Papers
No similar papers found.