SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Referring Audio-Visual Segmentation (Ref-AVS) aims to localize and segment a specific object in video based on natural language descriptions by jointly leveraging audio and visual cues, posing dual challenges of cross-modal alignment and fine-grained spatial localization. To address these, we propose a multimodal collaborative framework: (1) a multimodal large language model (MLLM) generates semantically rich tokens encoding audio-visual-language context; (2) a target-consistent semantic alignment loss enforces representation consistency for the same entity across modalities and linguistic expressions; and (3) the aligned semantic tokens are injected into the Segment Anything Model (SAM) to enable frame-level precise segmentation. Evaluated on the Ref-AVS benchmark, our method significantly outperforms existing state-of-the-art approaches, demonstrating superior cross-modal semantic understanding and object-level localization capability.

Technology Category

Application Category

📝 Abstract
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.Code will be available at https://github.com/DianJin-HFUT/SimToken
Problem

Research questions and friction points this paper is trying to address.

Segmenting specific objects in videos using audio, vision, and text information
Addressing challenges in cross-modal reasoning and fine-grained object localization
Integrating multimodal language models with segmentation models for video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates multimodal LLM with Segment Anything Model
Generates semantic token to prompt object segmentation
Uses target-consistent semantic alignment loss
🔎 Similar Papers
No similar papers found.
D
Dian Jin
HFUT, Hefei, China
Y
Yanghao Zhou
NUS, Singapore
J
Jinxing Zhou
MBZUAI, Abu Dhabi, UAE
J
Jiaqi Ma
MBZUAI, Abu Dhabi, UAE
Ruohao Guo
Ruohao Guo
Peking University
Multi-Modal LearningComputer VisionVideo Generation
Dan Guo
Dan Guo
IEEE senior member, Professor, Hefei University of Technology
Multimedia ComputingArtificial Intelligence