TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Target Person Tracking (TPT) benchmarks are largely confined to controlled laboratory settings, limiting their applicability to long-term, robust tracking by robots in crowded, unstructured environments. To address this gap, we introduce RoboTPT—the first large-scale, robot-centric TPT benchmark designed for egocentric (first-person) vision. RoboTPT encompasses diverse indoor and outdoor scenarios, enabling evaluation of long-term tracking, frequent occlusions, and cross-pedestrian re-identification. Its data is collected via a novel human-in-the-loop cart-following paradigm, synchronously capturing multi-modal sensor streams—including 3D LiDAR, RGB-D, 360° panoramic imagery, IMU, and wheel odometry. Annotations follow a behavior-guided protocol and include frame-level, fine-grained 2D bounding boxes across full sequences. We systematically evaluate state-of-the-art TPT methods on RoboTPT, uncovering critical failure modes in dynamic, cluttered settings. RoboTPT thus establishes a reproducible, high-fidelity benchmark to advance Embodied AI and Human-Robot Interaction (HRI).

Technology Category

Application Category

📝 Abstract
Tracking a target person from robot-egocentric views is crucial for developing autonomous robots that provide continuous personalized assistance or collaboration in Human-Robot Interaction (HRI) and Embodied AI. However, most existing target person tracking (TPT) benchmarks are limited to controlled laboratory environments with few distractions, clean backgrounds, and short-term occlusions. In this paper, we introduce a large-scale dataset designed for TPT in crowded and unstructured environments, demonstrated through a robot-person following task. The dataset is collected by a human pushing a sensor-equipped cart while following a target person, capturing human-like following behavior and emphasizing long-term tracking challenges, including frequent occlusions and the need for re-identification from numerous pedestrians. It includes multi-modal data streams, including odometry, 3D LiDAR, IMU, panoptic, and RGB-D images, along with exhaustively annotated 2D bounding boxes of the target person across 35 sequences, both indoors and outdoors. Using this dataset and visual annotations, we perform extensive experiments with existing TPT methods, offering a thorough analysis of their limitations and suggesting future research directions.
Problem

Research questions and friction points this paper is trying to address.

Tracking target person in crowded, unstructured robot-egocentric environments
Addressing long-term tracking challenges like occlusions and re-identification
Benchmarking existing methods with multi-modal data for HRI applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for crowded environments
Multi-modal data including LiDAR and RGB-D
Focus on long-term tracking and re-identification
🔎 Similar Papers
No similar papers found.
Hanjing Ye
Hanjing Ye
PhD Student at Southern University of Science and Technology
Robot Person FollowingPlace Recognition
Yu Zhan
Yu Zhan
Southern University of Science and Technology
robot person followinghuman pose estimationomnidirectional image
W
Weixi Situ
Southern University of Science and Technology
Guangcheng Chen
Guangcheng Chen
Southern University of Science and Technology
Polarimetric imaging3D reconstruction
J
Jingwen Yu
Southern University of Science and Technology
K
Kuanqi Cai
Italian Institute of Technology
H
Hong Zhang
Southern University of Science and Technology