DMTrack: Spatio-Temporal Multimodal Tracking via Dual-Adapter

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenges of cross-modal feature alignment and inefficient fusion in spatiotemporal multimodal tracking, this paper proposes a lightweight dual-adapter architecture. It comprises a spatiotemporal modality adapter and a progressive modality-complementary adapter, jointly guided by a self-prompting mechanism to facilitate cross-modal interaction. Efficient information flow is achieved through pixel-level shallow sharing and deep-layer attention modulation. With only 0.93M trainable parameters—and while keeping the backbone network frozen—the architecture effectively accomplishes cross-modal feature alignment and fusion. Evaluated on five mainstream benchmarks, it achieves state-of-the-art (SOTA) performance, significantly outperforming existing methods. The approach strikes a superior balance between accuracy and computational efficiency, demonstrating both high precision and remarkable parameter efficiency.

Technology Category

Application Category

📝 Abstract
In this paper, we explore adapter tuning and introduce a novel dual-adapter architecture for spatio-temporal multimodal tracking, dubbed DMTrack. The key of our DMTrack lies in two simple yet effective modules, including a spatio-temporal modality adapter (STMA) and a progressive modality complementary adapter (PMCA) module. The former, applied to each modality alone, aims to adjust spatio-temporal features extracted from a frozen backbone by self-prompting, which to some extent can bridge the gap between different modalities and thus allows better cross-modality fusion. The latter seeks to facilitate cross-modality prompting progressively with two specially designed pixel-wise shallow and deep adapters. The shallow adapter employs shared parameters between the two modalities, aiming to bridge the information flow between the two modality branches, thereby laying the foundation for following modality fusion, while the deep adapter modulates the preliminarily fused information flow with pixel-wise inner-modal attention and further generates modality-aware prompts through pixel-wise inter-modal attention. With such designs, DMTrack achieves promising spatio-temporal multimodal tracking performance with merely extbf{0.93M} trainable parameters. Extensive experiments on five benchmarks show that DMTrack achieves state-of-the-art results. Code will be available.
Problem

Research questions and friction points this paper is trying to address.

Develops dual-adapter architecture for multimodal tracking
Enhances cross-modality fusion via spatio-temporal feature adjustment
Achieves efficient tracking with minimal trainable parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-adapter architecture for multimodal tracking
Spatio-temporal modality adapter for feature adjustment
Progressive modality complementary adapter for cross-modality fusion
🔎 Similar Papers
No similar papers found.
W
Weihong Li
Hangzhou Institute for Advanced Study, Institute of Software Chinese Academy of Science, University of Chinese Academy of Science
Shaohua Dong
Shaohua Dong
University of North Texas
Computer Vision
H
Haonan Lu
OPPO Research Institute
Y
Yanhao Zhang
OPPO Research Institute
Heng Fan
Heng Fan
Assistant Professor, University of North Texas
Computer VisionMachine LearningArtificial Intelligence
L
Libo Zhang
Institute of Software Chinese Academy of Science