🤖 AI Summary
This work addresses the challenge that existing models struggle to interpret the position, motion, and directional changes of sound sources from dynamic binaural audio and reason accordingly. To overcome this limitation, the authors propose a motion-centric spatial audio augmentation framework that generates diverse training data through motion synthesis and incorporates a chain-of-thought–inspired fine-tuning mechanism to support intermediate reasoning steps, enabling end-to-end audio–language joint modeling. Key innovations include a motion-guided data augmentation strategy, a query-conditioned sound source separation module, and the co-optimization of an audio grounding model (AGM) with explicit reasoning pathways. Experiments demonstrate that integrating ground-truth masks into the reasoning process improves accuracy by 5.1% on single-event question answering tasks, confirming the synergistic benefit of explicit reasoning for both sound source separation and dynamic auditory understanding.
📝 Abstract
Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.