🤖 AI Summary
Existing audio-language models exhibit significant limitations in deep reasoning capabilities. To address this, we propose the first large-scale, multi-task, reasoning-intensive audio-language model, accompanied by CoTA—the first structured audio chain-of-thought (CoT) dataset comprising 1.2 million samples. We introduce a novel closed-model-assisted dual-annotation and question-answer generation paradigm, establishing a new training framework for audio reasoning. Methodologically, our approach integrates multi-stage data construction (human curation → model-assisted re-annotation → structured CoT injection), reasoning-oriented instruction tuning, and cross-modal alignment modeling. Our model achieves state-of-the-art performance across multiple benchmarks: +25.42% on MMAU-mini, +14.57% and +10.13% on AIR-Bench (chat and foundation tracks), and +8.01% on MELD—demonstrating substantial gains in audio reasoning capability.
📝 Abstract
Recent advancements in multimodal reasoning have largely overlooked the audio modality. We introduce Audio-Reasoner, a large-scale audio language model for deep reasoning in audio tasks. We meticulously curated a large-scale and diverse multi-task audio dataset with simple annotations. Then, we leverage closed-source models to conduct secondary labeling, QA generation, along with structured COT process. These datasets together form a high-quality reasoning dataset with 1.2 million reasoning-rich samples, which we name CoTA. Following inference scaling principles, we train Audio-Reasoner on CoTA, enabling it to achieve great logical capabilities in audio reasoning. Experiments show state-of-the-art performance across key benchmarks, including MMAU-mini (+25.42%), AIR-Bench chat/foundation(+14.57%/+10.13%), and MELD (+8.01%). Our findings stress the core of structured CoT training in advancing audio reasoning.