🤖 AI Summary
To address the challenges of safe, efficient, and socially acceptable navigation for autonomous delivery robots in dense urban pedestrian environments, this paper proposes a monocular vision–driven framework that unifies multi-pedestrian perception and behavioral understanding. Methodologically, it integrates human pose estimation, monocular depth estimation, and multi-object tracking to construct an identity-robust trajectory prediction model, augmented by a vulnerable pedestrian identification mechanism to enhance social awareness. The key contribution lies in leveraging joint pose–depth cues to improve identity maintenance accuracy under occlusion and ensure long-term trajectory consistency. Evaluated on the MOT17 benchmark, the approach achieves a 10% gain in IDF1 and a 7% improvement in MOTA, while maintaining detection accuracy above 85%. These results demonstrate significantly enhanced navigation reliability and socially inclusive interaction capability in high-density, heavily occluded scenarios.
📝 Abstract
The integration of Automated Delivery Robots (ADRs) into pedestrian-heavy urban spaces introduces unique challenges in terms of safe, efficient, and socially acceptable navigation. We develop the complete pipeline for a single vision sensor based multi-pedestrian detection and tracking, pose estimation, and monocular depth perception. Leveraging the real-world MOT17 dataset sequences, this study demonstrates how integrating human-pose estimation and depth cues enhances pedestrian trajectory prediction and identity maintenance, even under occlusions and dense crowds. Results show measurable improvements, including up to a 10% increase in identity preservation (IDF1), a 7% improvement in multiobject tracking accuracy (MOTA), and consistently high detection precision exceeding 85%, even in challenging scenarios. Notably, the system identifies vulnerable pedestrian groups supporting more socially aware and inclusive robot behaviour.