ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses open-vocabulary, zero-shot multi-object text-based referring tracking. We propose a fully unsupervised paradigm: leveraging a multimodal large language model (MLLM) to generate 3D-aware image descriptions; extracting cross-modal semantic embeddings via CLIP; and designing a lightweight fuzzy matching mechanism to align natural language queries with visual descriptions. Our key contribution is the first integration of zero-shot vision-language generation and spatially aware description into referring tracking—eliminating reliance on annotated data or predefined category priors and enabling arbitrary open-vocabulary queries. On the Refer-KITTI benchmark suite, our method achieves performance competitive with fully supervised approaches, while significantly improving generalization to unseen objects and novel queries. This establishes a new paradigm for open-set vision-language understanding in real-world applications such as autonomous driving.

Technology Category

Application Category

📝 Abstract
Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT
Problem

Research questions and friction points this paper is trying to address.

Tracking objects using text without prior training
Linking language understanding to object association
Generalizing to open-set textual queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot referring tracking with MLLM
3D-aware captions enhance descriptive flexibility
CLIP-based semantic encoding for query matching
🔎 Similar Papers
No similar papers found.
T
Tzoulio Chamiti
ETRO Department, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium; imec, Kapeldreef 75, B-3001 Leuven, Belgium
Leandro Di Bella
Leandro Di Bella
PhD student, Vrije Universiteit Brussel
Artificial Intelligence
Adrian Munteanu
Adrian Munteanu
Professor, ETRO department, Vrije Universiteit Brussel
Data compressionMultimodal Signal ProcessingHuman Biometrics
N
N. Deligiannis
ETRO Department, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium; imec, Kapeldreef 75, B-3001 Leuven, Belgium