Audio Visual Segmentation Through Text Embeddings

📅 2025-02-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Audio-visual segmentation (AVS) suffers from high annotation costs and data scarcity. To address this, we propose AV2T-SAM—a novel framework that maps audio features into the pre-trained text embedding space of the text-prompted Segment Anything Model (SAM), enabling audio-driven segmentation without additional visual annotations. Our key contributions are: (1) a CLIP-CLAP joint cross-modal alignment mechanism that explicitly models shared audio-visual semantics; and (2) a gated fusion feature $f_{ ext{CLIP}} odot f_{ ext{CLAP}}$ that effectively suppresses modality-specific noise. Crucially, AV2T-SAM requires no fine-tuning of SAM’s backbone—segmentation is achieved solely through embedding-space bridging and text prompt remapping. Evaluated on AVSBench, AV2T-SAM achieves state-of-the-art performance, significantly outperforming existing audio-guided segmentation methods. This demonstrates the efficacy of cross-modal semantic alignment for weakly supervised AVS.

Technology Category

Application Category

📝 Abstract

The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from the video frames. Researchers working on AVS suffer from limited datasets because hand-crafted annotation is expensive. Recent works attempt to overcome the challenge of limited data by leveraging the segmentation foundation model, SAM, prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing pre-trained knowledge of SAM, it does not address the fundamental challenge of the limited dataset for learning audio-visual relationships. To address these limitations, we propose extbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $mathbf{ extit{ extbf{f}}_{CLIP} odot extit{ extbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Experiments on the AVSBench dataset demonstrate state-of-the-art performance on both datasets of AVSBench. Our approach outperforms existing methods by effectively utilizing pretrained segmentation models and cross-modal semantic alignment.

Problem

Research questions and friction points this paper is trying to address.

Localize sounding objects in videos

Overcome limited AVS datasets

Enhance audio-visual alignment via text embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bridges audio features with text embeddings

Introduces shared semantics feature

Utilizes pretrained segmentation models effectively

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation