🤖 AI Summary
This work addresses the limitation of existing multi-object tracking methods, which focus primarily on geometric localization and identity association while neglecting semantic understanding of object behaviors—specifically, the “what” and “why” behind their actions. To bridge this gap, we propose LLMTrack, the first end-to-end framework that integrates multimodal large language models into semantic multi-object tracking. By decoupling localization from comprehension, LLMTrack synergistically combines Grounding DINO and LLaVA-OneVision and introduces a spatiotemporal fusion module to model complex trajectories. A LoRA-based three-stage progressive training strategy—encompassing visual alignment, temporal fine-tuning, and semantic injection—unifies geometric perception with cognitive reasoning. Evaluated on the BenSMOT benchmark, LLMTrack achieves state-of-the-art performance, significantly outperforming existing approaches in instance description, interaction recognition, and video summarization, while maintaining robust tracking accuracy.
📝 Abstract
Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.