AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing audio-language models excel at coarse-grained sound classification but struggle with fine-grained auditory semantic reasoning—such as inferring causal, temporal, or attribute-based relationships. To address this, we propose a cognition-inspired auditory semantic reasoning framework. We introduce AudSem, the first clean, purpose-built audio-text dataset explicitly designed for semantic description reasoning. Our method features a multi-stage robust audio–text pair generation pipeline, incorporates semantic-structured modeling, and establishes a zero-shot anti-contamination evaluation paradigm. Experiments demonstrate that our model consistently surpasses state-of-the-art methods across diverse training settings, achieving significant gains on fine-grained tasks—including sound attribute reasoning and event causal inference. Both the proposed model and the AudSem dataset are publicly released to foster reproducible research in auditory semantic understanding.

Technology Category

Application Category

📝 Abstract

Audio-language models have shown promising results in various sound understanding tasks, yet they remain limited in their ability to reason over the fine-grained semantics of sound. In this paper, we present AudSemThinker, a model whose reasoning is structured around a framework of auditory semantics inspired by human cognition. To support this, we introduce AudSem, a novel dataset specifically curated for semantic descriptor reasoning in audio-language models. AudSem addresses the persistent challenge of data contamination in zero-shot evaluations by providing a carefully filtered collection of audio samples paired with captions generated through a robust multi-stage pipeline. Our experiments demonstrate that AudSemThinker outperforms state-of-the-art models across multiple training settings, highlighting its strength in semantic audio reasoning. Both AudSemThinker and the AudSem dataset are released publicly.

Problem

Research questions and friction points this paper is trying to address.

Enhancing audio-language models' fine-grained semantic reasoning

Addressing data contamination in zero-shot audio evaluations

Improving semantic audio reasoning with a novel dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

AudSemThinker enhances audio-language models via semantic reasoning

Introduces AudSem dataset for semantic descriptor reasoning

Multi-stage pipeline ensures robust zero-shot evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow