Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the reasoning capabilities of speech-based interactive systems under real-time dialogue constraints, revealing a substantial performance gap between speech and text modalities. To address this, we introduce VERA—the first benchmark explicitly designed for native speech reasoning—comprising 2,931 authentic spoken dialogues spanning five task categories: mathematics, web navigation, scientific reasoning, long-context understanding, and factual recall. VERA enables cross-modal comparison and architectural analysis. We evaluate 12 state-of-the-art systems using joint latency-accuracy assessment, cascade-decoupled modeling, and fine-grained error diagnostics. Results show that the best text-based model achieves 54.0% average accuracy, while speech-based systems attain only 11.3%; the gap reaches 68.7 percentage points on mathematical reasoning. This study provides the first empirical evidence of accuracy stagnation in low-latency speech reasoning and identifies fundamental limitations in current decoupled (ASR → LLM → TTS) architectures.

Technology Category

Application Category

📝 Abstract
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning ability in voice-interactive systems under real-time constraints
Addressing large performance gaps between text and voice modalities in AI systems
Diagnosing why common mitigations fail to bridge text-voice reasoning performance gaps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voice-native benchmark for real-time reasoning evaluation
Direct text-voice comparison within model architectures
Diagnostic framework for thinking-speaking decoupled systems
🔎 Similar Papers
No similar papers found.
Yueqian Lin
Yueqian Lin
PhD Student, Duke University
Zhengmian Hu
Zhengmian Hu
Adobe Research
Deep LearningMonte Carlo
Qinsi Wang
Qinsi Wang
Duke University
Efficiency LLMModel Accelerate
Y
Yudong Liu
Duke University, Durham, NC, USA
H
Hengfan Zhang
Duke University, Durham, NC, USA
Jayakumar Subramanian
Jayakumar Subramanian
Senior Research Scientist, Adobe India
Agent based modelsReinforcement LearningMulti-agent Reinforcement LearningGame theoryDeep Learning
N
Nikos Vlassis
Adobe, San Jose, CA, USA
H
Hai Helen Li
Duke University, Durham, NC, USA
Y
Yiran Chen
Duke University, Durham, NC, USA