MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio reasoning benchmarks predominantly focus on static, single-source scenarios, failing to assess models’ capacity to comprehend multi-speaker interactions, dynamically evolving auditory events, and heterogeneous audio sources. To address this gap, we introduce MADAR—the first benchmark explicitly designed for multi-scenario, dynamically evolving audio reasoning. MADAR encompasses diverse acoustic modalities (e.g., speech, environmental sounds) and features five complex reasoning task categories across three question formats: multiple-choice, multiple-select, and open-ended QA. It is rigorously constructed from 3,000 carefully curated audio–question pairs. Comprehensive evaluation reveals severe limitations of state-of-the-art audio-language models: even the best-performing models—Qwen2.5-Omni and GPT-4o Audio—achieve only 76.67% accuracy on multiple-choice questions, with substantially lower performance on multiple-select and open-ended tasks; no model exceeds 80% accuracy on any format. These results fundamentally expose critical deficiencies in current models’ dynamic auditory understanding capabilities.

Technology Category

Application Category

📝 Abstract
The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in multi-scene dynamic audio reasoning
Evaluates complex reasoning across diverse audio sources
Benchmarks AI models on evolving audio scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scene dynamic audio reasoning benchmark
3000 curated question-answer pairs
Five categories of complex reasoning tasks
🔎 Similar Papers
H
Hui Li
Fudan University
C
Changhao Jiang
Fudan University
H
Hongyu Wang
Shanghai Jiao Tong University
M
Ming Zhang
Fudan University
J
Jiajun Sun
Fudan University
Z
Zhixiong Yang
Fudan University
Y
Yifei Cao
Fudan University
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
Xiaoran Fan
Xiaoran Fan
Fudan University
B
Baoyu Fan
IEIT Systems Co Ltd
Tao Ji
Tao Ji
中国人民大学
T
Tao Gui
Fudan University
Q
Qi Zhang
Fudan University
X
Xuanjing Huang
Fudan University