Progressive Confident Masking Attention Network for Audio-Visual Segmentation

📅 2024-06-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Audio-visual segmentation (AVS) aims to localize and segment sounding objects in a scene, yet existing methods suffer from insufficient multimodal fusion, high computational overhead, and underutilization of multi-stage predictions. To address these issues, we propose the Progressive Confidence Mask Attention Network (PMCANet). Our core innovation is a progressive confidence mask attention mechanism that dynamically filters query tokens based on multi-stage prediction confidence scores, enabling efficient cross-modal semantic alignment and focal attention on salient information. Additionally, PMCANet introduces a confidence-driven cross-modal selection module, multi-stage feature distillation, and a lightweight audio-visual interaction architecture. On mainstream AVS benchmarks, PMCANet achieves state-of-the-art performance with significantly fewer parameters and FLOPs, while substantially improving inference speed. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network's multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources. The code is available at: https://github.com/PrettyPlate/PCMANet.
Problem

Research questions and friction points this paper is trying to address.

Integrates audio and visual signals efficiently
Reduces computational costs in segmentation
Utilizes multi-stage predictive outputs effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Confident Masking Attention Network
Cross-attention module enhances semantic perception
Confidence-driven units utilize multi-stage outputs
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Wang
Department of Computing, Imperial College London; Department of Automation, BNRist, Tsinghua University
Feng Dong
Feng Dong
Tsinghua University, School of Economics and Management
MacroeconomicsMonetary EconomicsFinancial EconomicsChinese Economy
Jinchao Zhu
Jinchao Zhu
Nankai University