🤖 AI Summary
This work investigates whether large language models’ linguistic understanding degrades under speech input. To address this, we introduce C3T—the first dedicated benchmark integrating text-understanding tasks, controllable voice cloning, and a cross-modal evaluation framework—to systematically quantify performance degradation and speaker-invariance under speech input. C3T decouples phonetic attributes (e.g., timbre, speaking rate, accent) from semantic content, enabling fine-grained assessment of cross-modal robustness and fairness. Experimental results reveal substantial comprehension deterioration across mainstream speech-language models, with performance significantly influenced by speaker attributes—including gender, age, and accent—exposing latent biases and fragility. C3T thus provides a reproducible, interpretable, and attribute-aware evaluation standard for speech–language joint modeling.
📝 Abstract
The paper presents C3T (Cross-modal Capabilities Conservation Test), a new benchmark for assessing the performance of speech-aware large language models. The benchmark utilizes textual tasks and a voice cloning text-to-speech model to quantify the extent to which language understanding capabilities are preserved when the model is accessed via speech input. C3T quantifies the fairness of the model for different categories of speakers and its robustness across text and speech modalities.