Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Referring Multi-Object Tracking (RMOT) benchmarks provide only static linguistic descriptions—such as appearance, relative position, and initial motion state—failing to capture dynamic motion evolution (e.g., velocity and directional changes), leading to severe temporal misalignment between language and vision modalities and hindering cross-modal tracking performance. To address this, we propose the first MLLM-based RMOT framework, introducing a unified vision–motion–language alignment paradigm. Our approach explicitly models motion as a dedicated modality, designs a hierarchical cross-modal alignment module, and incorporates a motion-guided prediction head to enhance trajectory modeling. Evaluated on multiple RMOT benchmarks, our method achieves significant improvements over state-of-the-art methods, particularly excelling in complex dynamic scenes with superior tracking accuracy and robustness.

Technology Category

Application Category

📝 Abstract
Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses temporal discrepancy between static language references and dynamic visual data
Enhances multi-modal alignment through motion-aware descriptions and MLLMs
Improves tracking performance by integrating object motion dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates motion modality from object dynamics
Aligns visual queries with motion and reference cues
Uses motion-guided prediction head to enhance tracking
🔎 Similar Papers
No similar papers found.
W
Weiyi Lv
Shanghai University
N
Ning Zhang
PAII Inc.
H
Hanyang Sun
Shanghai University
H
Haoran Jiang
Shanghai University
K
Kai Zhao
Shanghai University
J
Jing Xiao
PAII Inc.
Dan Zeng
Dan Zeng
Sun Yat-sen University
Biometricscomputer visiondeep learning