Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lightweight visual tracking methods suffer from limited performance due to sparse sampling, which fails to adequately model spatiotemporal dependencies in videos. This work proposes STDTrack, a novel framework that, for the first time, integrates dense video sampling and robust spatiotemporal modeling into a lightweight tracker. It introduces a temporally propagated spatiotemporal token mechanism, a multi-frame information fusion module (MFIFM), and a quality-aware token maintenance strategy, complemented by a multi-scale prediction head to handle target scale variations. The proposed method achieves significant accuracy gains while maintaining high inference efficiency, attaining state-of-the-art results across six benchmarks. Notably, it delivers real-time performance with 192 FPS on GPU and 41 FPS on CPU on GOT-10k, rivaling several non-real-time, high-performance trackers.

Technology Category

Application Category

📝 Abstract
Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training--utilizing only one template and one search image per sequence--which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and cause the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we disign the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS(GPU) and 41 FPS(CPU).
Problem

Research questions and friction points this paper is trying to address.

visual tracking
spatiotemporal dependencies
lightweight tracker
dense sampling
performance gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatiotemporal dependency
dense video sampling
multi-frame information fusion
lightweight visual tracking
temporal token propagation
🔎 Similar Papers
No similar papers found.
J
Junze Shi
Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang Institute of Automation, Chinese Academy of Sciences, University of Chinese Academy of Sciences
Y
Yang Yu
Key Laboratory of Opto-Electronic Information Processing, Chinese Academy of Sciences, Shenyang Institute of Automation, Chinese Academy of Sciences
Jian Shi
Jian Shi
Institute of Automation, Chines Academy of Sciences
Computer GraphicsComputer Vision
Haibo Luo
Haibo Luo
Northeastern University
Large Language Models