SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluations of voice assistants focus predominantly on semantic understanding, neglecting systematic, quantitative assessment of generated speech acoustic quality. To address this gap, we propose SOVA-Bench—the first end-to-end benchmark for voice assistants—that unifies acoustic quality (e.g., Mel Cepstral Distortion, SIM), general knowledge, speech understanding, and semantic generation within a multi-level, cross-modal evaluation framework. Our reproducible evaluation pipeline integrates ASR error analysis, semantic consistency scoring, subjective Mean Opinion Score (MOS), and objective acoustic metrics. We comprehensively evaluate 12 state-of-the-art voice-oriented large language models (LLMs). Results reveal critical bottlenecks in speech naturalness and response coherence, highlighting the need for improved prosody modeling and contextual continuity. SOVA-Bench establishes a standardized, empirically grounded evaluation tool to advance human-like spoken interaction systems.

Technology Category

Application Category

📝 Abstract

Thanks to the steady progress of large language models (LLMs), speech encoding algorithms and vocoder structure, recent advancements have enabled generating speech response directly from a user instruction. However, benchmarking the generated speech quality has been a neglected but critical issue, considering the shift from the pursuit of semantic accuracy to vivid and spontaneous speech flow. Previous evaluation focused on the speech-understanding ability, lacking a quantification of acoustic quality. In this paper, we propose Speech cOnversational Voice Assistant Benchmark (SOVA-Bench), providing a comprehension comparison of the general knowledge, speech recognition and understanding, along with both semantic and acoustic generative ability between available speech LLMs. To the best of our knowledge, SOVA-Bench is one of the most systematic evaluation frameworks for speech LLMs, inspiring the direction of voice interaction systems.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking speech quality in LLM-based voice assistants

Evaluating both semantic and acoustic generative abilities

Providing systematic evaluation for speech LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking speech quality in LLM-based assistants

Evaluating both semantic and acoustic abilities

Systematic framework for speech LLM assessment

🔎 Similar Papers

No similar papers found.

Authors to Follow