Online Reasoning Video Object Segmentation

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work proposes the Online Referring Video Object Segmentation (ORVOS) task, which aims to perform per-frame pixel-level segmentation based on natural language queries containing implicit semantics and temporal anchors, under strict causal constraints that rely solely on past and current frames. To address challenges posed by evolving events and shifting references, the study introduces ORVOSB—a novel benchmark featuring frame-level causal annotations and referential shift labels—and designs a structured temporal token memory mechanism coupled with a continuously updated segmentation prompting module to enable online language-vision alignment under bounded computation. Experiments reveal a significant performance drop in existing methods when confronted with causality and referential shifts, whereas the proposed baseline establishes a solid performance foundation on the new benchmark, thereby facilitating future research in this direction.

Technology Category

Application Category

📝 Abstract

Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually-updated segmentation prompts and a structured temporal token reservoir for long-horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

Problem

Research questions and friction points this paper is trying to address.

Online Reasoning Video Object Segmentation

causal inference

referent shift

temporal grounding

frame-by-frame segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online Reasoning Video Object Segmentation

causal inference

referent shift