🤖 AI Summary
This work addresses the significant performance degradation of spoken large language models (LLMs) on logical reasoning tasks requiring entity tracking, which stems primarily from the loss of associations between entities and their attributes in continuous speech representations. The study identifies this gap as an intervenable entity-binding bottleneck and proposes a semantic binding mechanism that operates without reliance on accurate automatic speech recognition. By introducing an Entity-Aware Chain-of-Thought (EA-CoT) approach, the model is explicitly guided to enumerate and bind entities with their attributes prior to reasoning, thereby mitigating binding failures. Experimental results demonstrate that EA-CoT improves accuracy by up to 24.4% on spoken logical reasoning tasks, substantially narrowing the performance gap between spoken and text-based LLMs.
📝 Abstract
Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.