🤖 AI Summary
Audio-language models exhibit insufficient reasoning capabilities and lack unified evaluation benchmarks for multi-domain audio question answering (AQA).
Method: We introduce the first structured, acoustics-oriented multi-domain AQA benchmark, comprising three subtasks: bioacoustic identification, temporal soundscape understanding, and complex logical reasoning. Our framework enables cross-acoustic-domain evaluation and incorporates answer re-ranking robustness assessment. We conduct systematic evaluations using Qwen2-Audio-7B, AudioFlamingo 2, and Gemini-2-Flash.
Contribution/Results: The benchmark spans diverse acoustic semantics—from marine mammal vocalizations to real-world urban soundscapes—leveraging multi-source heterogeneous data and a top-1 accuracy protocol. Development-set results reveal significant inter-domain performance disparities, establishing a new standard for fine-grained, reproducible evaluation of acoustic reasoning in audio-language models.
📝 Abstract
We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes and complex real-world clips), the evaluation protocol (top-1 accuracy with answer-shuffling robustness), and baseline systems (Qwen2-Audio-7B, AudioFlamingo 2, Gemini-2-Flash). Preliminary results on the development set are compared, showing strong variation across models and subsets. This challenge aims to advance the audio understanding and reasoning capabilities of audio-language models toward human-level acuity, which are crucial for enabling AI agents to perceive and interact about the world effectively.