🤖 AI Summary
Existing evaluations of voice assistants focus predominantly on semantic understanding, neglecting systematic, quantitative assessment of generated speech acoustic quality. To address this gap, we propose SOVA-Bench—the first end-to-end benchmark for voice assistants—that unifies acoustic quality (e.g., Mel Cepstral Distortion, SIM), general knowledge, speech understanding, and semantic generation within a multi-level, cross-modal evaluation framework. Our reproducible evaluation pipeline integrates ASR error analysis, semantic consistency scoring, subjective Mean Opinion Score (MOS), and objective acoustic metrics. We comprehensively evaluate 12 state-of-the-art voice-oriented large language models (LLMs). Results reveal critical bottlenecks in speech naturalness and response coherence, highlighting the need for improved prosody modeling and contextual continuity. SOVA-Bench establishes a standardized, empirically grounded evaluation tool to advance human-like spoken interaction systems.
📝 Abstract
Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.