DDAVS: Disentangled Audio Semantics and Delayed Bidirectional Alignment for Audio-Visual Segmentation

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual segmentation methods struggle with multi-source audio entanglement and audio-visual temporal/semantic misalignment, leading to bias toward dominant sound sources and frequent omission of weak or co-occurring ones. To address this, we propose a pixel-level sound source localization framework comprising three key components: (1) a prototype memory bank for explicit audio semantic disentanglement, modeling multi-source audio semantics; (2) a learnable delayed bidirectional cross-modal alignment mechanism to mitigate audio-visual asynchrony and semantic mismatch; and (3) learnable audio queries with dual-path delayed cross-attention, augmented by contrastive learning to enhance audio discriminability and robustness to multiple sources. Evaluated on AVS-Objects and VPO benchmarks, our method achieves state-of-the-art performance across single-source, multi-source, and multi-instance scenarios—marking the first systematic solution to the long-standing problem of weak co-occurring sound source omission.

Technology Category

Application Category

📝 Abstract
Audio-Visual Segmentation (AVS) aims to localize sound-producing objects at the pixel level by jointly leveraging auditory and visual information. However, existing methods often suffer from multi-source entanglement and audio-visual misalignment, which lead to biases toward louder or larger objects while overlooking weaker, smaller, or co-occurring sources. To address these challenges, we propose DDAVS, a Disentangled Audio Semantics and Delayed Bidirectional Alignment framework. To mitigate multi-source entanglement, DDAVS employs learnable queries to extract audio semantics and anchor them within a structured semantic space derived from an audio prototype memory bank. This is further optimized through contrastive learning to enhance discriminability and robustness. To alleviate audio-visual misalignment, DDAVS introduces dual cross-attention with delayed modality interaction, improving the robustness of multimodal alignment. Extensive experiments on the AVS-Objects and VPO benchmarks demonstrate that DDAVS consistently outperforms existing approaches, exhibiting strong performance across single-source, multi-source, and multi-instance scenarios. These results validate the effectiveness and generalization ability of our framework under challenging real-world audio-visual segmentation conditions. Project page: https://trilarflagz.github.io/DDAVS-page/
Problem

Research questions and friction points this paper is trying to address.

Mitigates multi-source audio entanglement in segmentation
Addresses audio-visual misalignment with delayed bidirectional alignment
Enhances segmentation for weaker, smaller, or co-occurring sound sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled audio semantics via learnable queries and prototype memory
Delayed bidirectional alignment with dual cross-attention for robustness
Contrastive learning optimizes semantic space for discriminability and robustness
🔎 Similar Papers
No similar papers found.
J
Jingqi Tian
Tsinghua Shenzhen International Graduate School, Tsinghua University
Yiheng Du
Yiheng Du
UC Berkeley
Machine LearningAI for Science
H
Haoji Zhang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yuji Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University
I
Isaac Ning Lee
Tsinghua Shenzhen International Graduate School, Tsinghua University
X
Xulong Bai
Tsinghua Shenzhen International Graduate School, Tsinghua University
T
Tianrui Zhu
Tsinghua Shenzhen International Graduate School, Tsinghua University
J
Jingxuan Niu
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University