🤖 AI Summary
This work addresses the challenges of spectral redundancy and template mismatch under dynamic scenes in hyperspectral object tracking. The authors propose VLHTrack, a novel framework that leverages semantic descriptions generated by large language models to establish a semantic-spectral mapping, thereby mitigating redundancy through a Language-guided Band Selection Module (LBSM). Additionally, they introduce a vision-language multimodal fusion mechanism coupled with a Mamba-based Dynamic Template Update Module (DTUM) that employs a selective state space model to effectively handle significant target appearance variations. Evaluated on the HOT2023 and HOT2024 benchmarks, the proposed method substantially outperforms current state-of-the-art approaches, achieving notable improvements in both tracking accuracy and generalization capability.
📝 Abstract
Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.