ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses open-vocabulary, zero-shot multi-object text-based referring tracking. We propose a fully unsupervised paradigm: leveraging a multimodal large language model (MLLM) to generate 3D-aware image descriptions; extracting cross-modal semantic embeddings via CLIP; and designing a lightweight fuzzy matching mechanism to align natural language queries with visual descriptions. Our key contribution is the first integration of zero-shot vision-language generation and spatially aware description into referring tracking—eliminating reliance on annotated data or predefined category priors and enabling arbitrary open-vocabulary queries. On the Refer-KITTI benchmark suite, our method achieves performance competitive with fully supervised approaches, while significantly improving generalization to unseen objects and novel queries. This establishes a new paradigm for open-set vision-language understanding in real-world applications such as autonomous driving.

Technology Category

Application Category

📝 Abstract

Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT

Problem

Research questions and friction points this paper is trying to address.

Tracking objects using text without prior training

Linking language understanding to object association

Generalizing to open-set textual queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot referring tracking with MLLM

3D-aware captions enhance descriptive flexibility

CLIP-based semantic encoding for query matching

🔎 Similar Papers

No similar papers found.

Authors to Follow