Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing large audio-language models (LALMs) suffer substantial performance degradation under complex acoustic conditions—such as noise corruption, multi-source overlap, and temporal misalignment. To address this, we propose Thinking-with-Sound (TwS), the first audio chain-of-thought framework that enables dynamic, on-the-fly invocation of acoustic tools—including noise suppression, source separation, and precise temporal alignment—during language reasoning. TwS establishes cross-modal collaborative computation between linguistic inference and signal-level audio processing, without requiring model retraining. Evaluated on our newly constructed robustness benchmark MELD-Hard1k, TwS achieves absolute accuracy improvements of +24.73% for small models and up to +36.61% for large models, effectively mitigating over 50% of the performance drop observed in challenging acoustic environments.

Technology Category

Application Category

📝 Abstract

Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73%$ absolute accuracy, with improvements scaling consistently up to $36.61%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.

Problem

Research questions and friction points this paper is trying to address.

Enhancing audio reasoning in complex acoustic scenarios

Addressing performance degradation from acoustic perturbations

Integrating acoustic tools with multimodal reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework combines linguistic reasoning with audio analysis

Models actively think with signals through multimodal reasoning

Enables numerical analysis and digital manipulation of audio

🔎 Similar Papers

No similar papers found.