Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of spectral redundancy and template mismatch under dynamic scenes in hyperspectral object tracking. The authors propose VLHTrack, a novel framework that leverages semantic descriptions generated by large language models to establish a semantic-spectral mapping, thereby mitigating redundancy through a Language-guided Band Selection Module (LBSM). Additionally, they introduce a vision-language multimodal fusion mechanism coupled with a Mamba-based Dynamic Template Update Module (DTUM) that employs a selective state space model to effectively handle significant target appearance variations. Evaluated on the HOT2023 and HOT2024 benchmarks, the proposed method substantially outperforms current state-of-the-art approaches, achieving notable improvements in both tracking accuracy and generalization capability.
📝 Abstract
Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.
Problem

Research questions and friction points this paper is trying to address.

Hyperspectral Object Tracking
Spectral Redundancy
Target Deformation
Dynamic Scenes
Template Updating
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language Fusion
Language-Guided Band Selection
Dynamic Template Update
Mamba-based Modeling
Hyperspectral Object Tracking
🔎 Similar Papers
Rui Yao
Rui Yao
China University of Mining and Technology
Computer VisionMachine Learning
Y
Yuhong Zhang
School of Computer Science and Technology / School of Artificial Intelligence, and Mine Digitization Engineering Research Center of the Ministry of Education, and Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing and Emergency IoT in Underground Space, China University of Mining and Technology, Xuzhou 221116, China
K
Kunyang Sun
School of Computer Science and Technology / School of Artificial Intelligence, and Mine Digitization Engineering Research Center of the Ministry of Education, and Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing and Emergency IoT in Underground Space, China University of Mining and Technology, Xuzhou 221116, China
H
Hancheng Zhu
School of Computer Science and Technology / School of Artificial Intelligence, and Mine Digitization Engineering Research Center of the Ministry of Education, and Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing and Emergency IoT in Underground Space, China University of Mining and Technology, Xuzhou 221116, China
Jiaqi Zhao
Jiaqi Zhao
Xidian University
privacy-preserving machine learning
Z
Zhiwen Shao
School of Computer Science and Technology / School of Artificial Intelligence, and Mine Digitization Engineering Research Center of the Ministry of Education, and Jiangsu Provincial Industrial Technology Engineering Center for Intelligent Sensing and Emergency IoT in Underground Space, China University of Mining and Technology, Xuzhou 221116, China
Abdulmotaleb El Saddik
Abdulmotaleb El Saddik
MCRLab, University of Ottawa
Immersive MediaDigital TwinsHuman Centered AIMultimedia CommunicationMetaverse