AR2-4FV: Anchored Referring and Re-identification for Long-Term Grounding in Fixed-View Videos

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of linguistic reference discontinuity and re-identification difficulty in fixed-view videos caused by prolonged occlusion or object departure. To maintain referential coherence during target absence, the authors propose constructing an offline anchor library from the static background, where text-aligned anchor maps serve as persistent semantic memory. An anchor-driven re-entry prior combined with displacement-aware cues enables a lightweight ReID-Gating mechanism for efficient target recapture, without requiring initial-frame visibility or explicit modeling of appearance dynamics. Experiments demonstrate a 10.3% improvement in recapture rate and a 24.2% reduction in latency over the strongest baseline, while ablation studies confirm the contribution of each component.

Technology Category

Application Category

📝 Abstract
Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.
Problem

Research questions and friction points this paper is trying to address.

long-term referring
fixed-view videos
re-identification
occlusion
re-entry
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchor Map
Re-identification Gating
Long-term Referring
Fixed-view Video Grounding
Re-entry Prior
🔎 Similar Papers
No similar papers found.
T
Teng Yan
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yihan Liu
The Hong Kong University of Science and Technology (Guangzhou)
J
Jiongxu Chen
The Hong Kong University of Science and Technology (Guangzhou)
T
Teng Wang
The Hong Kong University of Science and Technology (Guangzhou)
Jiaqi Li
Jiaqi Li
The Chinese University of Hong Kong, Shenzhen
neural audio codecspeech language model
B
Bingzhuo Zhong
The Hong Kong University of Science and Technology (Guangzhou)