TPT-Bench: A Large-Scale, Long-Term and Robot-Egocentric Dataset for Benchmarking Target Person Tracking

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing Target Person Tracking (TPT) benchmarks are largely confined to controlled laboratory settings, limiting their applicability to long-term, robust tracking by robots in crowded, unstructured environments. To address this gap, we introduce RoboTPT—the first large-scale, robot-centric TPT benchmark designed for egocentric (first-person) vision. RoboTPT encompasses diverse indoor and outdoor scenarios, enabling evaluation of long-term tracking, frequent occlusions, and cross-pedestrian re-identification. Its data is collected via a novel human-in-the-loop cart-following paradigm, synchronously capturing multi-modal sensor streams—including 3D LiDAR, RGB-D, 360° panoramic imagery, IMU, and wheel odometry. Annotations follow a behavior-guided protocol and include frame-level, fine-grained 2D bounding boxes across full sequences. We systematically evaluate state-of-the-art TPT methods on RoboTPT, uncovering critical failure modes in dynamic, cluttered settings. RoboTPT thus establishes a reproducible, high-fidelity benchmark to advance Embodied AI and Human-Robot Interaction (HRI).

Technology Category

Application Category

📝 Abstract

Tracking a target person from robot-egocentric views is crucial for developing autonomous robots that provide continuous personalized assistance or collaboration in Human-Robot Interaction (HRI) and Embodied AI. However, most existing target person tracking (TPT) benchmarks are limited to controlled laboratory environments with few distractions, clean backgrounds, and short-term occlusions. In this paper, we introduce a large-scale dataset designed for TPT in crowded and unstructured environments, demonstrated through a robot-person following task. The dataset is collected by a human pushing a sensor-equipped cart while following a target person, capturing human-like following behavior and emphasizing long-term tracking challenges, including frequent occlusions and the need for re-identification from numerous pedestrians. It includes multi-modal data streams, including odometry, 3D LiDAR, IMU, panoptic, and RGB-D images, along with exhaustively annotated 2D bounding boxes of the target person across 35 sequences, both indoors and outdoors. Using this dataset and visual annotations, we perform extensive experiments with existing TPT methods, offering a thorough analysis of their limitations and suggesting future research directions.

Problem

Research questions and friction points this paper is trying to address.

Tracking target person in crowded, unstructured robot-egocentric environments

Addressing long-term tracking challenges like occlusions and re-identification

Benchmarking existing methods with multi-modal data for HRI applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for crowded environments

Multi-modal data including LiDAR and RGB-D

Focus on long-term tracking and re-identification

🔎 Similar Papers

No similar papers found.

Authors to Follow