VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation of spoken dialogue models overemphasizes textual response quality while neglecting critical vocal dimensions—including paralinguistic cues, acoustic properties, and environmental context—and lacks dedicated benchmarks. To address this, we introduce VocalBench, the first comprehensive, speech-native benchmark for evaluating spoken dialogue models. It systematically assesses four core dimensions—semantic quality, acoustic fidelity, conversational competence, and robustness—across 16 fine-grained skills. VocalBench is the first to formally define and quantify non-textual vocal elements, integrating human-annotated ground truth, multimodal metrics, realistic scenario modeling, and adversarial test instance construction. Evaluated on 9,400 diverse instances, it reveals substantial performance disparities among state-of-the-art models across all dimensions. VocalBench establishes a reproducible, fine-grained, and speech-centric evaluation standard, advancing rigorous, holistic assessment of spoken dialogue systems.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has accelerated the development of multi-modal models capable of vocal communication. Unlike text-based interactions, speech conveys rich and diverse information, including semantic content, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models predominantly focus on the quality of their textual responses, often overlooking critical aspects of vocal performance and lacking benchmarks with vocal-specific test instances. To address this gap, we propose VocalBench, a comprehensive benchmark designed to evaluate speech interaction models' capabilities in vocal communication. VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers 16 fundamental skills essential for effective vocal interaction. Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech-based interaction systems. Code and evaluation instances are available at https://github.com/SJTU-OmniAgent/VocalBench.
Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for vocal-specific test instances
Overlooking critical aspects of vocal performance
Need to evaluate multi-modal speech interaction models comprehensively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for vocal communication evaluation
Covers 16 essential vocal interaction skills
Includes 9,400 test instances across four dimensions
🔎 Similar Papers
No similar papers found.
Heyang Liu
Heyang Liu
Shanghai Jiao Tong University
ASRMultimodal understanding
Y
Yuhao Wang
Shanghai Jiao Tong University
Ziyang Cheng
Ziyang Cheng
University of Electronic Science and Technology of China
R
Ronghua Wu
Ant Group
Q
Qunshan Gu
Ant Group
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Y
Yu Wang
Shanghai Jiao Tong University