Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

To address the slow inference speed and deployment challenges of Transformer-based trackers on resource-constrained devices, this paper proposes HiT—a lightweight hierarchical Vision Transformer tracker—and DyHiT, a dynamic expansion framework. Key contributions include: (1) a bridging module to enhance cross-layer feature fusion; (2) dual-image positional encoding to improve spatial modeling; (3) a training-free dynamic routing mechanism that adaptively skips redundant computation paths based on scene complexity; and (4) a partitioned tracking strategy to reduce sequence modeling overhead. HiT achieves 61 FPS on Jetson AGX with a LaSOT AUC of 64.6%; DyHiT further boosts throughput to 111 FPS (AUC 62.4%) with negligible accuracy degradation. Moreover, the proposed inference acceleration technique accelerates SeqTrack-B256 by 2.68× without any precision loss.

Technology Category

Application Category

📝 Abstract

Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT.

Problem

Research questions and friction points this paper is trying to address.

Improving visual tracking speed on resource-limited devices

Enhancing feature representation with lightweight hierarchical ViT

Dynamic routing for adaptive accuracy-speed trade-offs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight hierarchical ViT for efficient tracking

Dynamic framework adapts to scene complexity

Training-free acceleration method boosts speed

🔎 Similar Papers

No similar papers found.