🤖 AI Summary
Audio-visual segmentation (AVS) suffers from high annotation costs and data scarcity. To address this, we propose AV2T-SAM—a novel framework that maps audio features into the pre-trained text embedding space of the text-prompted Segment Anything Model (SAM), enabling audio-driven segmentation without additional visual annotations. Our key contributions are: (1) a CLIP-CLAP joint cross-modal alignment mechanism that explicitly models shared audio-visual semantics; and (2) a gated fusion feature $f_{ ext{CLIP}} odot f_{ ext{CLAP}}$ that effectively suppresses modality-specific noise. Crucially, AV2T-SAM requires no fine-tuning of SAM’s backbone—segmentation is achieved solely through embedding-space bridging and text prompt remapping. Evaluated on AVSBench, AV2T-SAM achieves state-of-the-art performance, significantly outperforming existing audio-guided segmentation methods. This demonstrates the efficacy of cross-modal semantic alignment for weakly supervised AVS.
📝 Abstract
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from the video frames. Researchers working on AVS suffer from limited datasets because hand-crafted annotation is expensive. Recent works attempt to overcome the challenge of limited data by leveraging the segmentation foundation model, SAM, prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing pre-trained knowledge of SAM, it does not address the fundamental challenge of the limited dataset for learning audio-visual relationships. To address these limitations, we propose extbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $mathbf{ extit{ extbf{f}}_{CLIP} odot extit{ extbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Experiments on the AVSBench dataset demonstrate state-of-the-art performance on both datasets of AVSBench. Our approach outperforms existing methods by effectively utilizing pretrained segmentation models and cross-modal semantic alignment.