LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing multi-object tracking methods, which focus primarily on geometric localization and identity association while neglecting semantic understanding of object behaviors—specifically, the “what” and “why” behind their actions. To bridge this gap, we propose LLMTrack, the first end-to-end framework that integrates multimodal large language models into semantic multi-object tracking. By decoupling localization from comprehension, LLMTrack synergistically combines Grounding DINO and LLaVA-OneVision and introduces a spatiotemporal fusion module to model complex trajectories. A LoRA-based three-stage progressive training strategy—encompassing visual alignment, temporal fine-tuning, and semantic injection—unifies geometric perception with cognitive reasoning. Evaluated on the BenSMOT benchmark, LLMTrack achieves state-of-the-art performance, significantly outperforming existing approaches in instance description, interaction recognition, and video summarization, while maintaining robust tracking accuracy.

Technology Category

Application Category

📝 Abstract
Traditional Multi-Object Tracking (MOT) systems have achieved remarkable precision in localization and association, effectively answering \textit{where} and \textit{who}. However, they often function as autistic observers, capable of tracing geometric paths but blind to the semantic \textit{what} and \textit{why} behind object behaviors. To bridge the gap between geometric perception and cognitive reasoning, we propose \textbf{LLMTrack}, a novel end-to-end framework for Semantic Multi-Object Tracking (SMOT). We adopt a bionic design philosophy that decouples strong localization from deep understanding, utilizing Grounding DINO as the eyes and the LLaVA-OneVision multimodal large model as the brain. We introduce a Spatio-Temporal Fusion Module that aggregates instance-level interaction features and video-level contexts, enabling the Large Language Model (LLM) to comprehend complex trajectories. Furthermore, we design a progressive three-stage training strategy, Visual Alignment, Temporal Fine-tuning, and Semantic Injection via LoRA to efficiently adapt the massive model to the tracking domain. Extensive experiments on the BenSMOT benchmark demonstrate that LLMTrack achieves state-of-the-art performance, significantly outperforming existing methods in instance description, interaction recognition, and video summarization while maintaining robust tracking stability.
Problem

Research questions and friction points this paper is trying to address.

Multi-Object Tracking
Semantic Understanding
Cognitive Reasoning
Multimodal Large Language Models
Geometric Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Multi-Object Tracking
Multimodal Large Language Model
Spatio-Temporal Fusion
Progressive Training Strategy
LLMTrack
🔎 Similar Papers
P
Pan Liao
Northwestern Polytechnical University, China
F
Feng Yang
Northwestern Polytechnical University, China
D
Di Wu
Northwestern Polytechnical University, China
J
Jinwen Yu
Northwestern Polytechnical University, China
Yuhua Zhu
Yuhua Zhu
Postdoctoral Fellow, Stanford University
applied and computational mathematicskinetic equationsreinforcement learning
W
Wenhui Zhao
Northwestern Polytechnical University, China