An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge that large vision-language models (LVLMs) struggle to accurately segment the specific regions referred to by linguistic descriptions in multi-temporal visual reasoning tasks. To this end, we introduce Multi-Temporal Referring Expression Segmentation (MTRS), a novel task, and present MTRefSeg-21K—the first large-scale, open-source benchmark comprising 21K high-quality samples. We propose CRAFT-Agent, an automated data construction pipeline, and MTRefSeg-R1, a two-stage training framework that explicitly models cross-temporal differences and achieves pixel-level alignment between language expressions and temporally varying regions. Experimental results demonstrate that MTRefSeg-R1 substantially outperforms existing LVLM baselines, confirming both the inherent difficulty of the MTRS task and the effectiveness of our approach.

📝 Abstract

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

Problem

Research questions and friction points this paper is trying to address.

multi-temporal referring segmentation

temporal visual reasoning

language-guided grounding

change detection

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-temporal referring segmentation

large vision-language models

temporal change reasoning