VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM

๐Ÿ“… 2025-12-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing vision-language tracking methods rely on local search, exhibiting poor robustness under drastic viewpoint changes, severe occlusion, and rapid motion. This paper proposes the first multimodal large language model (MLLM)-based global vision-language tracking framework, overcoming the limitations of local window search by enabling target localization across the entire image through joint visual template and textual description guidance. Our core innovation is a location-aware visual prompting mechanism: spatial priors guide hierarchical MLLM reasoningโ€”first prioritizing candidate regions, then triggering global semantic search when necessary. Coupled with region-level visual prompt engineering and joint vision-language feature alignment, our method significantly improves tracking stability and target disambiguation accuracy under challenging conditions including heavy occlusion, large viewpoint variations, and high-speed motion. This work establishes a new paradigm for MLLM-driven general-purpose visual tracking.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.
Problem

Research questions and friction points this paper is trying to address.

Global tracking framework for robust object localization
Addresses distractions from similar objects in global search
Enhances tracking stability and target disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global tracking framework using Multimodal Large Language Models
Location-aware visual prompting with spatial priors
Region-level recognition prioritized over global inference
๐Ÿ”Ž Similar Papers
Jingchao Wang
Jingchao Wang
East China Normal University
AI
K
Kaiwen Zhou
School of Data Science and Engineering, East China Normal University, Shanghai, China
Z
Zhijian Wu
Medical Artificial Intelligence Laboratory, Westlake University, Hangzhou, China
K
Kunhua Ji
School of Data Science and Engineering, East China Normal University, Shanghai, China
D
Dingjiang Huang
School of Data Science and Engineering, East China Normal University, Shanghai, China
Yefeng Zheng
Yefeng Zheng
Professor, Westlake University, Hangzhou, China, IEEE Fellow, AIMBE Fellow
AI in HealthMedical ImagingComputer VisionNatural Language ProcessingLarge Language Model