AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot referring video object segmentation (RVOS) methods suffer from limited performance due to the absence of object-level temporal evidence during inference. This work proposes the first training-free agent framework that integrates object-level mask trajectories into a zero-shot RVOS pipeline. Specifically, it first leverages SAM3 to generate spatiotemporally coherent object trajectories across the entire video, then employs a multimodal large language model (MLLM) to perform semantic reasoning and iterative pruning of these trajectories based on linguistic queries. By effectively combining SAM3’s video perception capabilities with the MLLM’s semantic understanding, the method achieves state-of-the-art performance among training-free approaches across multiple benchmarks and demonstrates robustness across different MLLM backbones.

Technology Category

Application Category

📝 Abstract
Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.
Problem

Research questions and friction points this paper is trying to address.

Referring Video Object Segmentation
Zero-Shot
Training-Free
Temporal Reasoning
Object Tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentRVOS
zero-shot referring video object segmentation
training-free
mask tracks
multimodal large language model
🔎 Similar Papers
No similar papers found.