🤖 AI Summary
Audio-visual segmentation (AVS) aims to localize and segment sounding objects in a scene, yet existing methods suffer from insufficient multimodal fusion, high computational overhead, and underutilization of multi-stage predictions. To address these issues, we propose the Progressive Confidence Mask Attention Network (PMCANet). Our core innovation is a progressive confidence mask attention mechanism that dynamically filters query tokens based on multi-stage prediction confidence scores, enabling efficient cross-modal semantic alignment and focal attention on salient information. Additionally, PMCANet introduces a confidence-driven cross-modal selection module, multi-stage feature distillation, and a lightweight audio-visual interaction architecture. On mainstream AVS benchmarks, PMCANet achieves state-of-the-art performance with significantly fewer parameters and FLOPs, while substantially improving inference speed. The source code is publicly available.
📝 Abstract
Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network's multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources. The code is available at: https://github.com/PrettyPlate/PCMANet.