SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of voice assistants focus predominantly on semantic understanding, neglecting systematic, quantitative assessment of generated speech acoustic quality. To address this gap, we propose SOVA-Bench—the first end-to-end benchmark for voice assistants—that unifies acoustic quality (e.g., Mel Cepstral Distortion, SIM), general knowledge, speech understanding, and semantic generation within a multi-level, cross-modal evaluation framework. Our reproducible evaluation pipeline integrates ASR error analysis, semantic consistency scoring, subjective Mean Opinion Score (MOS), and objective acoustic metrics. We comprehensively evaluate 12 state-of-the-art voice-oriented large language models (LLMs). Results reveal critical bottlenecks in speech naturalness and response coherence, highlighting the need for improved prosody modeling and contextual continuity. SOVA-Bench establishes a standardized, empirically grounded evaluation tool to advance human-like spoken interaction systems.

Technology Category

Application Category

📝 Abstract
Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking speech quality in LLM-based voice assistants
Evaluating both semantic and acoustic generative abilities
Providing systematic evaluation for speech LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking speech quality in LLM-based assistants
Evaluating both semantic and acoustic abilities
Systematic framework for speech LLM assessment
🔎 Similar Papers
No similar papers found.
Y
Yixuan Hou
School of Artificial Intelligence, Shanghai Jiao Tong University, China
Heyang Liu
Heyang Liu
Shanghai Jiao Tong University
ASRMultimodal understanding
Y
Yuhao Wang
School of Artificial Intelligence, Shanghai Jiao Tong University, China; Ant Group, China
Ziyang Cheng
Ziyang Cheng
University of Electronic Science and Technology of China
R
Ronghua Wu
Ant Group, China
Q
Qunshan Gu
Ant Group, China
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Y
Yu Wang
School of Artificial Intelligence, Shanghai Jiao Tong University, China