Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLMs: A Case Study with In-the-Wild Data

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current speech-language large language models (LLMs) exhibit significant deficiencies in paralinguistic understanding—such as emotion, prosody, and other nonverbal cues—hindering their social and affective intelligence. To address this gap, we introduce CP-Bench, the first systematic benchmark for context-aware paralinguistic reasoning, featuring realistic tasks that jointly model linguistic content and nonverbal signals. We construct two novel question-answering datasets requiring integrated linguistic and emotional comprehension, enabling comprehensive evaluation of leading open- and closed-source speech LLMs, including ablation studies on temperature parameter effects. Experimental results reveal pervasive weaknesses in empathic reasoning across all models, with even state-of-the-art systems exhibiting critical limitations. This work provides the first quantitative characterization of the paralinguistic reasoning capabilities—and fundamental boundaries—of speech LLMs, establishing an empirical foundation and concrete improvement pathways for modeling and optimizing affectively intelligent dialogue systems.

Technology Category

Application Category

📝 Abstract
Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating speech-LLMs' understanding of paralinguistic cues like emotion
Assessing integration of verbal content with non-verbal contextual information
Identifying limitations in current speech-LLM social and emotional intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed CP-Bench benchmark for contextual paralinguistic reasoning
Evaluated speech-LLMs using curated question answering datasets
Analyzed model performance under temperature tuning variations
Qiongqiong Wang
Qiongqiong Wang
Lead Research Engineer, Institute for Infocomm Research, A*STAR, Singapore
Deep LearningArtificial IntelligenceMachine Learning
H
Hardik Bhupendra Sailor
Institute of Infocomm Research (I2R), A⋆STAR, Singapore
Tianchi Liu
Tianchi Liu
Tencent, Singapore; Ph.D. @ National University of Singapore; Ex-A*STAR, Singapore
Text-to-SpeechSpeech-LLMSpeaker VerificationAnti-spoofingDeepfake Detection
W
Wenyu Zhang
Institute of Infocomm Research (I2R), A⋆STAR, Singapore
M
Muhammad Huzaifah
Institute of Infocomm Research (I2R), A⋆STAR, Singapore
N
Nattadaporn Lertcheva
Institute of Infocomm Research (I2R), A⋆STAR, Singapore
Shuo Sun
Shuo Sun
Johns Hopkins University
Nancy F. Chen
Nancy F. Chen
ISCA Fellow, AAIA Fellow, Multimodal Generative AI Group Leader, AI for Education Head at A*STAR
Agentic AILarge Language ModelsConversational AI
J
Jinyang Wu
Institute of Infocomm Research (I2R), A⋆STAR, Singapore
AiTi Aw
AiTi Aw
Aw Ai Ti