🤖 AI Summary
Existing Target Person Tracking (TPT) benchmarks are largely confined to controlled laboratory settings, limiting their applicability to long-term, robust tracking by robots in crowded, unstructured environments. To address this gap, we introduce RoboTPT—the first large-scale, robot-centric TPT benchmark designed for egocentric (first-person) vision. RoboTPT encompasses diverse indoor and outdoor scenarios, enabling evaluation of long-term tracking, frequent occlusions, and cross-pedestrian re-identification. Its data is collected via a novel human-in-the-loop cart-following paradigm, synchronously capturing multi-modal sensor streams—including 3D LiDAR, RGB-D, 360° panoramic imagery, IMU, and wheel odometry. Annotations follow a behavior-guided protocol and include frame-level, fine-grained 2D bounding boxes across full sequences. We systematically evaluate state-of-the-art TPT methods on RoboTPT, uncovering critical failure modes in dynamic, cluttered settings. RoboTPT thus establishes a reproducible, high-fidelity benchmark to advance Embodied AI and Human-Robot Interaction (HRI).
📝 Abstract
Tracking a target person from robot-egocentric views is crucial for developing autonomous robots that provide continuous personalized assistance or collaboration in Human-Robot Interaction (HRI) and Embodied AI. However, most existing target person tracking (TPT) benchmarks are limited to controlled laboratory environments with few distractions, clean backgrounds, and short-term occlusions. In this paper, we introduce a large-scale dataset designed for TPT in crowded and unstructured environments, demonstrated through a robot-person following task. The dataset is collected by a human pushing a sensor-equipped cart while following a target person, capturing human-like following behavior and emphasizing long-term tracking challenges, including frequent occlusions and the need for re-identification from numerous pedestrians. It includes multi-modal data streams, including odometry, 3D LiDAR, IMU, panoptic, and RGB-D images, along with exhaustively annotated 2D bounding boxes of the target person across 35 sequences, both indoors and outdoors. Using this dataset and visual annotations, we perform extensive experiments with existing TPT methods, offering a thorough analysis of their limitations and suggesting future research directions.