🤖 AI Summary
To address temporal inconsistency in single-object tracking and short-term trajectory prediction under occlusion, scale variation, and temporal drift, this paper proposes a lightweight constant-memory temporal Transformer framework that unifies tracking, detection, and short-horizon prediction. Key contributions include: (1) a ground-truth-prioritized memory module enabling stable identity propagation within a single-layer temporal attention; (2) a burn-in anchoring loss ensuring robust initialization; and (3) an end-to-end trainable architecture integrating a fixed-size memory buffer, lightweight attention, and contrastive learning for real-time inference. Evaluated on Mini-LaSOT (20%), the method achieves 76.3 AUC and 53.7 FPS with only 4.3 GB GPU memory—significantly outperforming TrackFormer and MOTRv2, especially in challenging scenarios involving rapid motion, large-scale variation, and severe occlusion.
📝 Abstract
Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce extbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.