VoiceAgentBench: Are Voice Assistants ready for agentic tasks?

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech assistant benchmarks primarily evaluate isolated capabilities (e.g., ASR or QA), lacking systematic assessment of multilingualism, multi-turn agent interaction, tool invocation, and safety robustness—especially in culturally diverse contexts such as India. This work introduces the first comprehensive agent-oriented benchmark for Speech Large Language Models (Speech LLMs), featuring multilingual mixed-dialogue evaluation, speaker-aware diverse query sampling, adversarial robustness testing, tool-call consistency verification, and safety validation. We innovatively propose a speaker-embedding-based voice query sampling algorithm and construct a high-quality evaluation set comprising over 5,500 synthetic speech queries. Experimental results reveal significant bottlenecks in current Speech LLMs regarding cross-lingual generalization, multi-tool coordination, and defense against adversarial attacks.

Technology Category

Application Category

📝 Abstract
Large-scale Speech Language Models (SpeechLMs) have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks primarily focus on isolated capabilities such as transcription, or question-answering, and do not systematically evaluate agentic scenarios encompassing multilingual and cultural understanding, as well as adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark designed to evaluate SpeechLMs in realistic spoken agentic settings. It comprises over 5,500 synthetic spoken queries, including dialogues grounded in Indian context, covering single-tool invocations, multi-tool workflows, multi-turn interactions, and safety evaluations. The benchmark supports English, Hindi, and 5 other Indian languages, reflecting real-world linguistic and cultural diversity. We simulate speaker variability using a novel sampling algorithm that selects audios for TTS voice conversion based on its speaker embeddings, maximizing acoustic and speaker diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Our experiments reveal significant gaps in contextual tool orchestration tasks, Indic generalization, and adversarial robustness, exposing critical limitations of current SpeechLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating SpeechLMs in realistic multilingual agentic scenarios
Assessing tool orchestration and cultural understanding gaps
Testing adversarial robustness and Indic language generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

VoiceAgentBench evaluates SpeechLMs in agentic scenarios
Novel sampling algorithm maximizes acoustic and speaker diversity
Benchmark covers multilingual tool orchestration and adversarial robustness
🔎 Similar Papers
No similar papers found.