AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios

๐Ÿ“… 2025-11-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing referring multi-object tracking (RMOT) research focuses predominantly on ground-level scenes, failing to address the semantic understanding and long-range tracking requirements inherent to wide-area aerial perspectives from unmanned aerial vehicles (UAVs). This work bridges that gap by introducing AerialMindโ€”the first large-scale RMOT benchmark specifically designed for UAV scenarios. We propose COALA, a semi-automatic annotation framework that substantially reduces the cost of aligning multiple objects with natural language expressions. Furthermore, we design HawkEyeTrack, a novel method leveraging vision-language collaborative representation learning, cross-modal feature alignment, and spatiotemporal context modeling to enhance instruction-driven detection and tracking. Experiments demonstrate that AerialMind poses significant challenges, and HawkEyeTrack achieves substantial improvements over state-of-the-art baselines on natural language-guided multi-object tracking. Collectively, this work establishes a critical data foundation and technical framework for embodied intelligent UAV systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
Problem

Research questions and friction points this paper is trying to address.

Extending referring multi-object tracking from ground to aerial UAV scenarios
Addressing the lack of large-scale benchmarks for natural language interaction in UAVs
Developing efficient labeling and vision-language methods for UAV tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-automated agent-based labeling framework COALA
HawkEyeTrack method enhances vision-language representation
Improves UAV scenario perception and tracking
๐Ÿ”Ž Similar Papers
No similar papers found.
C
Chenglizhao Chen
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China)
S
Shaofeng Liang
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China)
Runwei Guan
Runwei Guan
Hong Kong University of Science and Technology (Guangzhou) / Founder of FertiTech AI
Multi-Modal LearningUnmanned Surface VesselRadar PerceptionAI Medicine
X
Xiaolou Sun
Purple Mountain Laboratories
Haocheng Zhao
Haocheng Zhao
Xi'an Jiaotong Liverpool University
Neural NetworksNeural Network PruningRadar-Camera Fusion
Haiyun Jiang
Haiyun Jiang
Associate Professor, Shanghai Jiao Tong University,
(Multimodal) Large ModelIntelligent Target RecognitionKnowledge Graph
T
Tao Huang
College of Science and Engineering, James Cook University
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC
Q
Qing-Long Han
School of Engineering, Swinburne University of Technology, Melbourne