🤖 AI Summary
In crowded indoor environments, escort robots often suffer from frequent tracking interruptions and weak recovery capabilities due to the difficulty of real-time identification of escorted individuals’ identities and actions. To address this, we propose an end-to-end online joint detection framework that unifies person re-identification and action prediction within a single neural network architecture—first of its kind. Our method integrates spatiotemporal feature extraction, lightweight online action detection, and identity matching, enabling simultaneous human detection, persistent identity tracking, and fine-grained action classification (e.g., stopping, distraction, obstruction). Evaluated on a newly constructed escort-scenario action dataset and public benchmarks, our model significantly outperforms state-of-the-art baselines, particularly in rapid reacquisition and responsive recovery after tracking loss. Results demonstrate superior real-time performance, robustness, and practical applicability in complex, dynamic indoor settings.
📝 Abstract
The deployment of robot assistants in large indoor spaces has seen significant growth, with escorting tasks becoming a key application. However, most current escorting robots primarily rely on navigation-focused strategies, assuming that the person being escorted will follow without issue. In crowded environments, this assumption often falls short, as individuals may struggle to keep pace, become obstructed, get distracted, or need to stop unexpectedly. As a result, conventional robotic systems are often unable to provide effective escorting services due to their limited understanding of human movement dynamics. To address these challenges, an effective escorting robot must continuously detect and interpret human actions during the escorting process and adjust its movement accordingly. However, there is currently no existing dataset designed specifically for human action detection in the context of escorting. Given that escorting often occurs in crowded environments, where other individuals may enter the robot's camera view, the robot also needs to identify the specific human it is escorting (the subject) before predicting their actions. Since no existing model performs both person re-identification and action prediction in real-time, we propose a novel neural network architecture that can accomplish both tasks. This enables the robot to adjust its speed dynamically based on the escortee's movements and seamlessly resume escorting after any disruption. In comparative evaluations against strong baselines, our system demonstrates superior efficiency and effectiveness, showcasing its potential to significantly improve robotic escorting services in complex, real-world scenarios.