🤖 AI Summary
This work addresses the challenges of audio noise interference and insufficient modality interaction in audio-visual segmentation by proposing SDAVS, a novel framework comprising a Selective Noise-Robust Processor (SNRP) and a Discriminative Audio-Visual Fusion strategy (DAMF). SNRP leverages an attention mechanism to enhance relevant auditory cues while suppressing irrelevant noise, and DAMF integrates cross-modal alignment with discriminative fusion to construct consistent yet distinctive multimodal representations. Evaluated on mainstream benchmarks, the proposed method achieves state-of-the-art performance, demonstrating significant improvements in both accuracy and robustness for segmenting sounding objects—particularly in complex scenes involving multiple sound sources.
📝 Abstract
The ability to capture and segment sounding objects in dynamic visual scenes is crucial for the development of Audio-Visual Segmentation (AVS) tasks. While significant progress has been made in this area, the interaction between audio and visual modalities still requires further exploration. In this work, we aim to answer the following questions: How can a model effectively suppress audio noise while enhancing relevant audio information? How can we achieve discriminative interaction between the audio and visual modalities? To this end, we propose SDAVS, equipped with the Selective Noise-Resilient Processor (SNRP) module and the Discriminative Audio-Visual Mutual Fusion (DAMF) strategy. The proposed SNRP mitigates audio noise interference by selectively emphasizing relevant auditory cues, while DAMF ensures more consistent audio-visual representations. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on benchmark AVS datasets, especially in multi-source and complex scenes. \textit{The code and model are available at https://github.com/happylife-pk/SDAVS}.