🤖 AI Summary
Existing end-to-end speech-based dialogue evaluation benchmarks are limited to synthetic speech and single-turn tasks, failing to capture genuine multi-turn interactive capabilities. This paper introduces Audio MultiChallenge—the first end-to-end benchmark for natural, multi-turn spoken dialogue evaluation. Our method establishes a speech-native, multi-axis assessment framework covering four dimensions: reasoning memory, instruction retention, self-consistency, and speech editing. It is the first to incorporate realistic interaction challenges—including environmental sounds, paralinguistic cues, and mid-dialogue speech repair. Evaluation employs a hybrid pipeline of audio-native agents and human annotators, preserving disfluencies inherent in spontaneous speech and enabling fine-grained failure-mode analysis. Experiments reveal that state-of-the-art models (e.g., Gemini 3 Pro Preview) achieve only a 54.65% overall pass rate, exposing systematic bottlenecks in speech editing, audio cue tracking, and long-context self-consistency.
📝 Abstract
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.